Last year, the CONCUR conference series inaugurated its Test-of-Time Award, whose purpose is to recognise important achievements in Concurrency Theory that were published at the CONCUR conference and have stood the test of time. This year, the following four papers were chosen to receive the CONCUR Test-of-Time Awards for the periods 1994–1997 and 1996–1999 by a jury consisting of Rob van Glabbeek (chair), Luca de Alfaro, Nathalie Bertrand, Catuscia Palamidessi, and Nobuko Yoshida:
Last year, I interviewed the CONCUR 2020 Test-of-Time Award recipients and was asked by Javier Esparza (chair of the CONCUR SC) and Ilaria Castellani (outgoing chair of the IFIP WG 1.8 on Concurrency Theory) to do the same with the current batch of awardees. (In passing, let me thank Nathalie Bertrand and Nobuko Yoshida for their kind help with the interviews!)
This post is devoted to the interview I conducted via email with Rajeev Alur, Thomas A. Henzinger, Orna Kupferman and Moshe Y. Vardi. Reading the answers I received from that dream team of colleagues was like a masterclass for me and I trust that their thoughts on their award-winning paper will be of interest to many of the readers of this blog. Enjoy!
Luca: You receive the CONCUR ToT Award 2021 for your paper Alternating Refinement Relations, which appeared at CONCUR 1998. In that article, you gave what I consider to be a fundamental contribution, namely the introduction of refinement relations for alternating transition systems. Could you briefly explain to our readers what alternating transition systems are? Could you also tell us how you came to study the question addressed in your award-winning article and why you focused on simulation- and trace-based refinement relations? Which of the results in your paper did you find most surprising or challenging?
Answer: When we model a system by a graph, our model abstracts away some details of the system. In particular, even when systems are deterministic, states in the model may have several successors. The nondeterminism introduced in the model often corresponds to different actions taken by the system when it responds to different inputs from its environment. Indeed, a transition in a graph that models a composite system corresponds to a step of the system that may involve some components. Alternating transition systems (ATSs) enable us to model composite systems in more detail. In an ATS, each transition corresponds to a possible move in a game between the components, which are called agents. In each move of the game, all agents choose actions, and the successor state is deterministically determined by all actions. Consequently, ATSs can distinguish between collaborative and adversarial relationships among components in a composite system. For example, the environment is typically viewed adversarially, meaning that a component may be required to meet its specification no matter how the environment behaves.
In an earlier paper, some of us introduced ATSs and Alternating Temporal Logics, which can specify properties of agents in a composite system. The CONCUR 1998 paper provided refinement relations between ATSs which correspond to alternating temporal logics. Refinement is a central issue in a formal approach to the design and analysis of reactive systems. The relation “I refines S '' intuitively means that system S has more behaviors than system I. It is useful to think about S being a specification and I an implementation. Now, if we consider a composite implementation I||E and specification S||E and we want to check that the component I refines the component S, then the traditional refinement preorders are inappropriate, as they allow I to achieve refinement of I||E with respect to S||E by constraining its environment E. Alternating refinement relations are defined with respect to ATSs that model the interaction among the underlying components, and they enable us to check, for example, that component I has fewer behaviors than component S no matter how component E behaves. They are called “alternating” because refinement may restrict implementation actions but must not restrict environment actions. In other words, refinement may admit fewer system actions but, at the same time, more environment actions.
It was nice to see how theoretical properties of preorders in the traditional setting are carried over to the game setting, and so are the results known then about the computational price of moving to a game setting. First, the efficiency of the local preorder of simulation with respect to the global preorder of trace containment is maintained. As in the traditional setting, alternating simulation can be checked in polynomial time, whereas alternating trace-containment is much more complex. Second, the branching vs. linear characterizations of the two preorders is preserved: alternating simulation implies alternating trance containment, and the logical characterization of simulation and trace-containment by CTL and LTL, respectively, is carried over to their alternating temporal logics counterparts. The doubly-exponential complexity of alternating trace containment, as opposed to the PSPACE complexity of trace containment, is nicely related to the doubly-exponential complexity of LTL synthesis, as opposed to its PSPACE model-checking complexity,
Luca: In your paper, you give logical characterisations of your alternating refinement relations in terms of fragments of alternating temporal logic. Logical characterisations of refinement relations are classic results in our field and I find them very satisfying. Since I teach a number of those results in my courses, I'd be interested in hearing how you would motivate their interest and usefulness to a student or a colleague. What would your "sales pitch" be?
Answer: There is extensive research on the expressive power of different formalisms. Logical characterization of refinement relations tells us something about the distinguishing power of formalisms. For example, while the temporal logic CTL* is more expressive than the temporal logic CTL, the two logics have the same distinguishing power: if you have two systems and can distinguish between them with a CTL* formula (that is, your formula is satisfied only in one of the systems), then you should be able to distinguish between the two systems also with a CTL formula. Moreover, while CTL is not more expressive than LTL, we know that CTL is “more distinguishing” than LTL. These results have to do with the logical characterizations of trace containment and simulation. The distinguishing power of a specification formalism is useful when we compare systems, in particular an implementation and its abstraction: if we know that the properties we care about are specified in some formalism L, and our system refines the abstraction according to a refinement relation in which the satisfaction of specifications in L is preserved, then we can perform verification on the abstraction.
Luca: I am interested in how research collaborations start, as I like to tell "research-life stories" to PhD students and young researchers of all ages. Could you tell us how you started your collaboration on the award-winning paper?
Answer:Subsets of us were already collaborating on other topics related to reactive models and model checking, and all of us shared a common belief that the field was in need to move from the limited setting of closed systems to a more general setting of open systems, that is, systems that interact with an environment. Open systems occur not only when the environment is fully or partly unknown, but also when a closed system is decomposed into multiple components, each of them representing an open system. To build “openness” into models and specifications as first-class citizens quickly leads to the game-theoretic (or “alternating”) setting. It was this realization and the joint wish to provide a principled and systematic foundation for the modeling and verification of open systems which naturally led to this collaboration.
Luca: Did any of your subsequent research build explicitly on the results and the techniques you developed in your award-winning paper? Which of your subsequent results on alternating transition systems and their refinement relations do you like best? Is there any result obtained by other researchers that builds on your work and that you like in particular or found surprising?
Answer: Various subsets of us pursued multiple research directions that developed the game-theoretic setting for modeling and verification further, and much remains to be done. Here are two examples. First, the game-theoretic setting and the alternating nature of inputs and outputs are now generally accepted as providing the proper semantic foundation for interface and contract formalisms for component-based design. Second, studying strategic behavior in multi-player games quickly leads to the importance of probabilistic behavior, say in the form of randomized decisions and strategies, of equilibria, when players have non-complementary objectives, and of auctions, when players need to spend resources for decisions. All of these are still very active topics of research in computer-aided verification, and they also form a bridge to the algorithmic game theory community.
Luca: One can view your work as a bridge between concurrency theory and multi-agent systems. What impact do you think that your work has had on the multi-agent-system community? And what has our community learnt from the work done in the field of multi-agent systems? To your mind, what are the main differences and points of contact in the work done within those communities?
Answer: Modeling interaction in multi-agent systems is of natural interest to planning problems studied in the AI community. In 2002, the International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS) was formed and the annual International Conference on Autonomous Agents and Multiagent Systems (AAMAS) was launched. The models, logics, and algorithms developed in the concurrency and formal methods communities have had a strong influence on research presented at AAMAS conferences over the past twenty years. Coincidentally, this year our paper on Alternating-Time Temporal Logic was chosen for the IFAAMAS Influential Paper Award.
Luca: What are the research topics that you find most interesting right now? Is there any specific problem in your current field of interest that you'd like to see solved?Answer:Research on formal verification and synthesis, including our paper, assumes that the model of the system is known. Reinforcement learning has emerged as a promising approach to the design of policies in scenarios where the model is not known and has to be learned by agents by exploration. This leads to an opportunity for research at the intersection of reactive synthesis and reinforcement learning. A potentially promising direction is to consider reinforcement learning for systems with multiple agents with both cooperative and adversarial interactions.
The realization that reactive systems have to satisfy their specifications in all environments has led to extensive research relating formal methods with game theory. Our paper added alternation to refinement relations. The transition from one to multiple players has been studied in computer science in several other contexts. For the basic problem of reachability in graphs, it amounts to moving from reachability to alternating reachability. We recently studied this shift in other fundamental graph problems, like the generation of weighted spanning trees, flows in networks, vertex covers, and more. In all these extensions, we consider a game between two players that take turns in jointly generating the outcome. One player aims at maximizing the value of the outcome (e.g., maximize the weight of the spanning tree, the amount of flow that travels in the network, or the size of the vertex cover), whereas the second aims at minimizing the value. It is interesting to see how some fundamental properties of graph algorithms are lost in the alternating setting. For example, following a greedy strategy is not beneficial in alternating spanning trees, optimal strategies in alternating flow networks may use fractional flows, and while the vertex-cover problem is NP-complete, an optimal strategy for the maximizer player can be found in polynomial time. Many more questions in this setting are still open.
Luca: What advice would you give to a young researcher who is keen to start working on topics related to alternating transition systems and logics?
Answer: One important piece of advice to young researchers is to question the orthodoxy. Sometimes it is necessary to learn everything that is known about a topic but then take a step back, look at the bigger picture, reexamine some of the fundamental assumptions behind the established ways of thinking, change the models that everyone has been using, and go beyond the incremental improvement of previous results. This is particularly true in formal methods, where no single model or approach fits everything. And young researchers stand a much better chance of having a really fresh new thought than those who have been at it for many years.Most machine learning classes and textbooks mention that there is no universal supervised learning algorithm that can do reasonably well on all learning problems. Indeed, a series of “no free lunch theorems” state that even in a simple input space, for any learning algorithm, there always exists a bad conditional distribution of outputs given inputs where this algorithm performs arbitrarily bad.
For example, from the classic book of Luc Devroye, László Györfi, and Gábor Lugosi [1, Theorem 7.2 and its extensions], for inputs uniformly distributed in \([0,1]\), for any decreasing sequence \((\varepsilon_n)_{n \geqslant 0}\) which is less than 1/16, for any learning algorithm (which takes pairs of observations and outputs a prediction function), there exists a conditional distribution on \(\{-1,1\}\) for which the expected risk of the classifier learned from \(n\) independent and identically distributed observations of (input,output) pairs is greater than \(\varepsilon_n\) for all \(n \geqslant 1\), while the best possible expected error rate is zero.
Such theorems do not imply that all learning methods are equally bad, but rather that all learning methods will suffer from some weaknessess. Throughout this blog we will try to better understand the weaknessess and strengths of popular methods through learning theory.
The key is to control the potential weaknesses of a learning method by making sure that in “favorable” scenarios, it leads to strong guarantees. When taking the simplest example of vectorial inputs in \(\mathbb{R}^d\), we can construct model classes of increasing complexity for which we can start to draw useful comparisons between learning methods.
Several aspects of the joint distribution of (input,output) \((X,Y)\) make the problem easy or hard. For concreteness and simplicity, I will focus on regression problems where the output space is \(\mathcal{Y} = \mathbb{R}\) and with the square loss, so that the optimal function, the “target” function, is \(f^\ast(x) = \mathbb{E}(Y|X=x)\). But much of the discussion extends to classification or even more complex outputs (see, e.g., [1]).
Curse of dimensionality. In the context of vectorial inputs, the slowness of universal learning algorithms can be characterized more precisely, and leads to the classical curse of dimensionality. Only assuming that the optimal target function is Lipschitz-continuous, that is, for all \(x,x’\), \(| f^\ast(x)-f^\ast(x’)| \leqslant L \| x – x’\|\) (for any arbitrary norm on \(\mathbb{R}^d\)), the optimal excess risk of a prediction function \(\hat{f}\) obtained from \(n\) observations, that is, the expected squared difference between \(f^\ast(x)\) and \(\hat{f}(x)\) cannot be less than a constant times \(n^{-2/(d+2)}\), that is, in order for this rate to be smaller than some \(\varepsilon <1 \), we need \(n\) to be larger than \(\displaystyle ( {1}/{\varepsilon} )^{d/2+1}\), with thus an exponential dependence in \(d\) (see [2]).
In other words, exponentially many observations are needed for a reasonable performance on all problems with a minimal set of assumptions (here Lipschitz-continuity), and this bad behavior is unavoidable unless extra assumptions are added, which we now describe.
Low-dimensional support. If the data occupy only a \(r\)-dimensional subspace of \(\mathbb{R}^d\), with \(r \leqslant d\) (and typically much smaller), then one should expect a better convergence rate. This also extends to data supported on a (smooth) manifold, as illustrated below. Note that this assumption does not concern the outputs, and can reasonably be checked given the data by performing principal component analysis or some form of manifold learning. Essentially, in terms of convergence rates, \(d\) is replaced by \(r\). This is obvious if the learning algorithm has access to the \(r\)-dimensional representation, and it requires more work if not.
Smoothness of the target function. This is done by assuming some bounded derivatives for the target function \(f^\ast\). With bounded \(s\)-th order derivatives, we could expect that the problem is easier (note that Lipschitz-continuity corresponds to \(s=1\)). This is illustrated below with one-dimensional inputs. Essentially, in terms of rates, \(d\) is replaced by \(d/s\) [2, Theorem 3.2]. Therefore, when the smoothness order \(s\) is of order \(d\), the dependence in the dimension disappears.
Latent variables. If we assume that the target function depends only on a \(r\)-dimensional linear projection of the input, then we should expect a better complexity. The most classical example is the dependence on a subset of the \(d\) original variables. Essentially, in terms of rates, \(d\) is replaced by \(r\). This is obvious when the latent variables are known (as we can replace input data by the \(r\) latent variables), totally not otherwise, as this requires some form of adaptivity (see below).
Need for adaptivity. We typically don’t know in advance if these properties are satisfied or not, as some are easily testable (support of distribution) without knowing the target function \(f^\ast\), while others are not.
The goal is to have a single method that can adapt to all of these situations (which are non-exclusive). That is, if the problem has any of these reasons to be easy, will the learning method benefit from it? Typically, most learning methods have at least one hyperparameter controlling overfitting (e.g., a regularization parameter), and the precise value of this hyperparameter will depend on the difficulty of the problem, and we will assume that we have a reasonable way to estimate this hyperparameter (e.g., cross-validation). A method is then said adaptive if with a well chosen value of the hyperparameter, we get the optimal (or close to optimal) rate of estimation that benefits from the extra assumption.
Quest for adaptivity: who wins? Among classical learning techniques, which ones are adaptive to which properties? In short, barring computational and optimization issues: $$\mbox{ local averaging } < \mbox{ positive definite kernels } < \mbox{ neural networks }.$$ Every time, the next method in the list gains adaptivity to support, smoothness and then latent variables. Note however that optimization for neural networks is more delicate (see below).
Let’s now briefly look at these methods one by one. I will assume basic knowledge of these, for more details see [4, 5] or my new book in preparation.
The earliest and simplest learning methods that could adapt to any target functions were local averaging methods aiming at approximating directly the conditional expectation \(f^\ast(x) = \mathbb{E}(Y|X=x)\), with the most classical examples being k-nearest neighbors and Nadaraya-Watson estimators.
These methods are naturally adaptive to having a reduced support for inputs. Indeed, in the simplest case of a distribution supported in a low-dimensional subspace, the global metric is equivalent to a local metric on the support, without any need to explicitly know the subspace, a situation which extends to smooth manifolds. In order to be adaptive, the hyperparameter has to depend on the dimension \(r\) of the manifold, e.g., for \(k\)-nearest-neighbors, \(k\) has to be taken proportional to \(n^{2/(2+r)}\) (see [3]).
However, there is no adaptivity to the smoothness of the target function or to potential latent variables unless dedicated algorithms are used such as local regression. Kernel methods and neural networks lead to such adaptivity.
We consider prediction functions of the form $$f(x) = \sum_{j=1}^m a_j ( b_j^\top x )_+,$$ which is the traditional single hidden layer fully connected neural network with ReLU activation functions (a constant term can be added to the linear term within the ReLU by simply appending \(1\) to \(x\), please do not call these terms “biases” as this has another meaning in statistics).
The vector \(a \in \mathbb{R}^m\) represents output weights, while the matrix \(b \in \mathbb{R}^{m \times d}\) represents input weights.
Empirical risk minimization (ERM). We assume given \(n\) i.i.d. observations \((x_1,y_1),\dots,(x_n,y_n) \in \mathbb{R}^d \times \mathbb{R}\), and we will fit models by minimizing the \(\ell_2\)-regularized empirical risk (you can call it “weight decay” but it already has a better name in machine learning and statistics). That is, we minimize $$R(a,b) = \frac{1}{2n} \sum_{i=1}^n \Big( y_i \, – \sum_{j=1}^m a_j ( b_j^\top x_i )_+ \Big) ^2 + \frac{\lambda}{2} \sum_{j=1}^m \Big\{ a_j^2 + \| b_j\|_2^2 \Big\},$$ where \(\lambda > 0\) is a regularization parameter.
Overparametrization. We will consider the limit of large number \(m\) of hidden neurons. Depending whether we optimize over both input weights \(b\) and output weights \(a\), or simply output weights (with then a proper initialization of the input weights), we get different behaviors.
We assume that the input weights \(b_j \in \mathbb{R}^d\) are sampled uniformly from the Euclidean sphere of radius \(1 / \sqrt{m}\), and that we only optimize over the output weights \(a_j, j = 1,\dots,m\). This is exactly a ridge regression problem (square loss and squared Euclidean penalty), for which the theory of positive definite kernels applies and will lead to an interesting behavior for infinite \(m\) [7, 8]. That is, the solution can be obtained by using the kernel function \(\hat{k}\) defined as $$\hat{k}(x,x’) = \sum_{j=1}^m ( b_j^\top x )_+( b_j^\top x’ )_+,$$ and looking for prediction functions of the form $$f(x) = \sum_{i=1}^n \alpha_i \hat{k}(x,x_i).$$ See, e.g., [6] for details. In particular, as long as \(\hat{k}(x,x’)\) can be computed efficiently, the complexity of solving the ERM problem is independent of the number of neurons \(m\). We will now discuss in further detail the over-parameterized case where \(m=\infty\).
When \(m\) tends to infinity, If the input weights are fixed and initialized randomly from the \(\ell_2\)-sphere of radius \(1/\sqrt{m}\), then by the law of large numbers, \(\hat{k}(x,x’)\) tends to $$k(x,x’) = \mathbb{E}_{b \sim {\rm uniform} (\mathbb{S}^{d-1}) } (b^\top x )_+ (b^\top x’)_+,$$ where \(\mathbb{S}^{d-1}\) is the unit \(\ell_2\)-sphere in \(d\) dimensions. This kernel has a closed form and happens to be equal to (see [9]) $$ \frac{1}{2\pi d} \|x\|_2 \| x’\|_2 \big[ ( \pi \, – \varphi) \cos \varphi + \sin \varphi \big],$$ where \(\cos \varphi = \frac{ x^\top x’ }{{ \|x\|_2 } { \| x’\|_2 } }\). It can also be seen (see, e.g., [10] for details) as using predictors of the form $$f(x) = \int_{\mathbb{S}^{d-1}} ( b^\top x)_+ d \mu(b),$$ for some measure \(\mu\) on \(\mathbb{S}^{d-1}\), with the penalty $$\frac{\lambda}{2} \int_{\mathbb{S}^{d-1}} \big| \frac{d\mu}{d\tau}(b) \big|^2 d\tau(b),$$ for the uniform probability measure \(d\tau\) on the hypersphere \(\mathbb{S}^{d-1}\). This representation will be useful for the comparison with neural networks.
Note that here we use random features in a different way from their common use [8], where the kernel \(k\) comes first and is approximated by \(\hat{k}\) to obtain fast algorithms, typically with \(m < n\). Here we start from \(\hat{k}\) and find the limiting \(k\) to understand the behavior of overparameterization. The resulting function space is a (Sobolev) space of functions on the sphere with all \(s\)-th order derivatives which are square-integrable, with \(s = d/2+3/2\) [10].
Finally, the number of neurons \(m\) needed to reach the kernel regime is well understood, at least in simple situations [15, 16].
We can now look at the various forms of adaptivity of kernel methods based on Sobolev spaces.
Adaptivity to reduced support. Like model averaging techniques, adaptivity to input data supported on a subspace is rather straightforward, and it extends to smooth manifolds (see, e.g., [12] for a proof for the Gaussian kernel).
Adaptive to smoothness. A key attractive feature of kernel methods is that they can circumvent the curse of dimensionality for smooth target functions, and, by simply ajusting the regularization parameter \(\lambda\), ridge regression will typically adapt to the smoothness of the target function, and thus benefit from easy problems.
The simplest instance is when the target function is within the space of functions defined above, where we immediately get estimation rates which are independent of dimension (at least in the exponent). This however requires at least \(s>d/2\) derivatives (because Sobolev spaces are reproducing kernel Hilbert spaces (RKHS) only in this situation), but the adaptivity extends to functions outside of the RKHS (see, e.g., [11]).
Adaptivity to linear latent variables. Unfortunately, kernel methods are not adaptive even to basic linear structures. That is, if \(f^\ast\) depends only on the first component of \(x\), then ridge regression, if not modified, will not take advantage of it. In the context of neural networks, one striking example is that a single neuron \(x \mapsto (b^\top x)_+\) does not belong to the RKHS, and leads to bad estimation rates (see, e.g., [10, 13]).
Before looking at some experiments, let’s look at some optimization considerations, that will be important later for neural networks.
Avoiding overparameterized neuron representations. In the kernel regime, optimizing directly the cost function with \(m\) neurons by gradient descent is problematic for two reasons when \(m\) is large:
Experiments. We first consider a very simple one-dimensional example, where we look at how the estimation with the kernel function \(k\) varies as a function of the hyperparameter \(\lambda\), from underfitting to overfitting. We can observe that all learned functions are smooth, as expected.
We can now compare the convergence rates for the excess risk (expected squared distance beween \(f\) and \(f^\ast\)). We can see that the rates are better for smooth functions, and with the proper choice of regularization parameter, the kernel method adapts to it.
We can now optimize over all weights, output \(a\) and input \(b\). Four natural questions come to mind:
Overparameterization limit (feature learning regime). When optimizing over both sets of weights, we can first note that the prediction function is invariant by the change of variable $$ a_j \leftarrow \mu_j a_j \ , \ \ b_j \leftarrow \mu_j^{-1} b_j.$$ Then by optimizing over \(\mu_j\) which is only involved in the penalty term \(\frac{\lambda}{2} \big( a_j^2 + \|b_j\|_2^2 \big)\), we get \(\mu_j^\ast = a_j^{-2} \| b_j \|_2^2\), and the penalty \(\lambda |a_j| \| b_j\|_2\). We thus get the equivalent optimization problem of minimizing $$\tilde{R}(a,b) = \frac{1}{2n} \sum_{i=1}^n \Big( y_i \, – \sum_{j=1}^m a_j ( b_j^\top x_i )_+ \Big) ^2 + {\lambda} \sum_{j=1}^m |a_j| \| b_j\|_2 .$$ By restricting the \(b_j\) on the unit sphere (which is OK because by optimizing over \(\mu_j\) we become scale-invariant), we can write $$\sum_{j=1}^m a_j ( b_j^\top x_i )_+ = \int_{\mathbb{S}^{d-1}} (b^\top x)_+ d\mu(b), $$ for $$ \mu = \sum_{j=1}^m a_j \delta_{b_j}$$ a weighted sum of Diracs, and \(\sum_{j=1}^m |a_j| \| b_j\|_2 = \int_{\mathbb{S}^{d-1}} \! | d\mu(b)|\) the total variation of \(\mu\). Thus, letting \(m\) go to infinity, the infinite sums become integrals of general measures, and we end up considering the set of functions that can be written as (see [10] and the many references therein for details): $$f(x) = \int_{\mathbb{S}^{d-1}} ( b^\top x)_+ d \mu(b),$$ with the penalty $$\lambda \int_{\mathbb{S}^{d-1}} \! | d\mu(b)|.$$ It has an \(\ell_1\)-norm flavor (as opposed to the \(\ell_2\)-norm for kernels), and leads to further adaptivity, if optimization problems are resolved (which is itself not easy in polynomial time, see below).
The minimal number of neurons to achieve the limiting behavior in terms of predictive performance (assuming optimization problems are resolved) can also be characterized, and grows exponentially in dimension without strong assumptions like the ones made in this blog post (see [10]). We will now study the adaptivity properties of using the function space above.
Adaptive to low-dimensional support and smoothness. It turns out that the adaptivity properties of kernel methods are preserved, both with respect to the support and the smoothness of the target function (see, e.g., [17, 10]). However, neural networks can do better!
Adaptive to linear substructures for one hidden layer. Given that the training of neural networks involves finding the best possible input weights, which are involved in the first linear layer, it is no surprise that we obtain adaptivity to linear latent variables. That is, if we can write \(f^\ast(x) = g(c_1^\top x, \dots, c_r^\top x)\) for some function \(g: \mathbb{R}^r \to \mathbb{R}\), then the vectors \(c_j\)’s will be essentially estimated among the \(b_j\)’s by the optimization algorithm. Therefore, when using the \(\ell_1\)-based function space, the convergence rate in the excess risk will depend on \(r\) and not \(d\) in the exponent [10, 13]. In particular, neural networks can perform non-linear variable selection (while the Lasso only performs linear variable selection). See a nice experiment in [13] for binary classification.
One could imagine that with more hidden layers, this extends to non-linear smooth projections of the data, that is, to cases where we assume that \(f^\ast(x) = g(h(x))\) where \(h: \mathbb{R}^d \to \mathbb{R}^r\) is a smooth function.
Thus, we obtain a stronger adaptivity for infinite \(m\) and the good choice of regularization parameter \(\lambda\). We could then try to reproduce for neural networks the figures obtained for kernel methods with varying \(\lambda\). Unfortunately, this is where non-convex optimization will make everything harder.
Global convergence of gradient flow for infinite width. The traditional algorithm to minimize the empirical risk is gradient descent and its stochastic extensions. In this blog post, we consider gradient descent with small step-sizes, which can be approximated by a gradient flow (as explained in last June blog post). All parameters are randomly initialized, and we consider \(b_j\) uniformly distributed on the sphere of radius \(1/\sqrt{m}\), and \(a_j\) uniformly distributed in \(\{-1/\sqrt{m},1/\sqrt{m}\}\) (this is essentially equivalent to the traditional “Glorot” initialization [18]).
This corresponds to the “mean-field” scaling of initialization where neurons from both layers move (see more details in last July blog post for other scalings). As shown in a joint work with Lénaïc Chizat [19] and explained in last June blog post, when \(m\) tends to infinity, then the gradient flow can only converge to the global minimum of the objective function, which is a non-trivial result because the cost function is not convex in \((a,b)\).
Apparent good properties for small \(m\). A key property which is not yet well understood is that the global convergence behavior can be observed for \(m\) relatively small. For example, considering the one-dimensional regression example below, where the target function is the combination of 5 hidden neurons, with \(m=32\) hidden neurons, it is already possible to learn a good function with high probability (this would not be the case with \(m=5\)).
One extra benefit here is that no regularization (e.g., no penalty) seems needed to obtain this good behavior. Therefore it seems that we get the best of both worlds: reasonably small \(m\) (so not too costly algorithmically), and no need to regularize.
However, like in any situation where overfitting is not observed, there is the possibility of underfitting. To study this, let us consider a noiseless problem where \(y\) is a deterministic function of \(x\), and with a sligthly more complicated target function, which now requires 9 hidden neurons.
Even with \(m = 8192\) hidden neurons, the resulting function is not able to fit the data, which is one extreme form of simplicity bias [21]. There is thus strong underfitting with a single hidden layer (the situation seems to be different with deeper networks). How can this be alleviated?
A simple way to avoid the underfitting phenomenon above (but potentially get overfitting, see below) is to minimize the cost \(R(a,b)\) by leveraging a convex sub-problem, as done in the glorious days of kernel methods.
In the cost function $$R(a,b) = \frac{1}{2n} \sum_{i=1}^n \Big( y_i \, – \sum_{j=1}^m a_j ( b_j^\top x_i )_+ \Big) ^2 + \frac{\lambda}{2} \sum_{j=1}^m \Big\{ a_j^2 + \| b_j\|_2^2 \Big\},$$ the problem is convex with respect to all \((a_j)\)’s; moreover, it is a least-squares problems which can be solved in closed form by matrix inversion (with algorithms that are much more robust to ill-conditioning than gradient descent).
Then, if we denote by \(\Phi(b) \in \mathbb{R}^{n \times m}\) the matrix with elements \((b_j^\top x_i)_+\), the cost function can be written as a function of \(a \in \mathbb{R}^m\) as $$R(a,b) = \frac{1}{2n} \| y \, – \Phi(b) a \|_2^2 +\frac{\lambda}{2}\| a\|_2^2 + \frac{\lambda}{2} \sum_{j=1}^m \| b_j\|_2^2.$$ It can be minimized in closed form as $$ a = ( \Phi(b)^\top \Phi(b) + n \lambda I)^{-1} \Phi(b)^\top y = \Phi(b)^\top ( \Phi(b)\Phi(b)^\top + n \lambda I)^{-1} y,$$ where the matrix inversion lemma has been used. This leads to: $$S(b) = \inf_{a \in \mathbb{R}^m} R(a,b) = \frac{\lambda}{2} y^\top ( \Phi(b)\Phi(b)^\top + n \lambda I)^{-1} y + \frac{\lambda}{2} \sum_{j=1}^m \| b_j\|_2^2.$$ The matrix \(\Phi(b)\Phi(b)^\top \in \mathbb{R}^{n \times n}\) is the traditional kernel matrix of the associated non-linear features. The resulting optimization problem is exactly one used in kernel learning [20].
Computing \(S(b)\) and its gradient with respect to \(b\) can be done in time \(O(n^2 m)\) when \(m\) is large. Since given the hidden neurons, we get the global optimum, we avoid the underfitting problem as well as the ill-conditioning issues.
When \(m\) tends to infinity, then the optimization problem fits our analysis in [19], that is, we can see the limiting flow as a Wasserstein gradient flow that can only converge to the global optimum of the associated cost function although the overall problem is not convex (as opposed to what happens in multiple kernel learning described in this older post). While there is currently no further quantitative theoretical evidence that a good optimization behavior can be reached for a smaller \(m\), it empirically solves the underfitting issue above, as shown below.
We can now perform experiments with varying \(\lambda\)’s.
Experiments. The two plots below depict the learnt functions for a smooth and non-smooth objective as \(\lambda\) is varied, highlighting the importance of regularization.
We can now compare the convergence rates for the excess risk (expected squared distance beween \(f\) and \(f^\ast\)). We can see that the rates are better for smooth functions, and with the proper choice of regularization parameter, neural networks adapt to it. For adaptivity to linear latent variables, see [13].
As I tried to show in this blog post, adaptivity is a key driver of good predictive performance of learning methods. It turns out that for the three classes of methods I considered, the more complex algorithms led to more adaptivity: no optimization for local averaging, finite-dimensional convex optimization for kernel methods and then non-convex optimization for neural networks (or equivalently infinite-dimensional convex optimization).
While neural networks led to more adaptivity, they are still not totally well understood, in particular in terms of the various biases that their training implies, both for shallow and deep networks. This makes a lot of research directions to follow!
Acknowledgements. I would like to thank Lénaïc Chizat and Lawrence Stewart for proofreading this blog post and making good clarifying suggestions.
[1] Luc Devroye, László Györfi, and Gábor Lugosi. A Probabilistic Theory of Pattern Recognition,
volume 31. Springer Science & Business Media, 1996.
[2] László Györfi, Michael Kohler, Adam Krzyżak, Harro Walk. A Distribution-free Theory of Nonparametric Regression. New York : Springer, 2002.
[3] Samory Kpotufe. k-NN regression adapts to local intrinsic dimension. In Advances in Neural Information Processing Systems, 2011.
[4] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT Press, 2018.
[5] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.
[6] Bernhard Schölkopf and Alexander J. Smola. Learning with Kernels. MIT Press, 2001.
[7] Radford M. Neal. Bayesian Learning for Neural Networks. PhD thesis, University of Toronto, 1995.
[8] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, 2008.
[9] Youngmin Cho and Lawrence K. Saul. Kernel methods for deep learning. In Advances in Neural Information Processing Systems, 2009.
[10] Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629–681, 2017.
[11] Andreas Christmann, Ingo Steinwart. Support Vector Machines. Springer, 2008.
[12] Thomas Hamm, Ingo Steinwart. Adaptive Learning Rates for Support Vector Machines Working on Data with Low Intrinsic Dimension. Technical report, Arxiv 2003.06202, 2021.
[13] Lénaïc Chizat, Francis Bach. Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss. Proceedings of the Conference on Learning Theory (COLT), 2020.
[14] Francis Bach. On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions. Journal of Machine Learning Research, 18(19):1-38, 2017.
[15] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization. In Advances in Neural Information Processing Systems, pages 1657–1665, 2015.
[16] Alessandro Rudi, Lorenzo Rosasco. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems, 2017.
[17] Ryumei Nakada, Masaaki Imaizumi. Adaptive Approximation and Generalization of Deep Neural Network with Intrinsic Dimensionality. Journal of Machine Learning Research, 21(174):1−38, 2020.
[18] Xavier Glorot, Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. Proceedings of the International Conference on Artificial Intelligence and Statistics, 2010.
[19] Lénaïc Chizat, Francis Bach. On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport. Advances in Neural Information Processing Systems, 2018
[20] Olivier Chapelle, Vladimir Vapnik, Olivier Bousquet, Sayan Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46(1):131-159, 2002.
[21] Harshay Shah, Kaustav Tamuly, Aditi Raghunathan, Prateek Jain, Praneeth Netrapalli. The Pitfalls of Simplicity Bias in Neural Networks. In Advances in Neural Information Processing Systems, 2020.
We present a guest post on Benny Chor's life and works by Oded Goldreich. The post has many refs to papers. They will appear at the end with pointers to them.
Benny Chor was born on December 23rd 1956 and grew-up in Tel-Aviv, Israel. He studied Mathematics in the Hebrew University, receiving a B.Sc. in 1980 and an M.Sc. in 1981. He then switched to studying Computer Science at MIT, and graduated in 1985 with a PhD thesis titled
Two Issues in Public Key Cryptography -- RSA Bit Security and a New Knapsack Type System
which received an ACM Distinguished Dissertation award. After post-doctoral periods at MIT and Harvard, he took a faculty position at the Computer Science Department of the Technion (1987-2001), and then at Tel-Aviv University, where he served as the chairman of the department for two years. He died on June 10th 2021, from a terminal disease.
Although Benny was a very articulated and verbal person, I find it impossible to describe his personality in words. The point is that words cannot capture the experience of interacting with him, which was always sheer fun. Still, I guess I should say something personal as a close friend of his. So I will just mention that he lived happily with Metsada, whom he met in the summer of 1981, and that they lovingly raised three kids: Arnon, Omer and Aya. Actually, let me also mention his life-long close relationship with his brother, Kobi.
Focusing on his contributions to science, I will confine myself to the areas of cryptography and randomized computation. Still, let me mention that in the mid 1990s, Benny's research interests gradually shifted to computational biology, but I will not review his contributions to that area, since it is very remote from my own expertise.
Oded and Benny |
Benny with his family's namesake (Chor = Bull) |
Authors: Kathryn Gray, Mingwei Li, Reyan Ahmed, Stephen Kobourov
Download: PDF
Abstract: Evolving trees arise in many real-life scenarios from computer file systems
and dynamic call graphs, to fake news propagation and disease spread. Most
layout algorithms for static trees, however, do not work well in an evolving
setting (e.g., they are not designed to be stable between time steps). Dynamic
graph layout algorithms are better suited to this task, although they often
introduce unnecessary edge crossings. With this in mind we propose two methods
for visualizing evolving trees that guarantee no edge crossings, while
optimizing (1) desired edge length realization, (2) layout compactness, and (3)
stability. We evaluate the two new methods, along with four prior approaches
(two static and two dynamic), on real-world datasets using quantitative
metrics: stress, desired edge length realization, layout compactness,
stability, and running time. The new methods are fully functional and available
on github.
Authors: Sebastian Schlag, Tobias Heuer, Lars Gottesbüren, Yaroslav Akhremtsev, Christian Schulz, Peter Sanders
Download: PDF
Abstract: This paper considers the balanced hypergraph partitioning problem, which asks
for partitioning the vertices into $k$ disjoint blocks of bounded size while
minimizing an objective function over the hyperedges. Here, we consider the
most commonly used connectivity metric. We describe our open source hypergraph
partitioner KaHyPar which is based on the successful multi-level approach --
driving it to the extreme of one level for (almost) every vertex. Using
carefully designed data structures and dynamic update techniques, this approach
offers a very good time-quality tradeoff. We present two preprocessing
techniques -- pin sparsification using locality sensitive hashing and community
detection based on the Louvain algorithm. The community structure is used to
guide the coarsening process that incrementally contracts vertices.
Portfolio-based partitioning of the contracted hypergraph already achieves good
initial solutions. While reversing the contractions, a combination of
highly-localized direct $k$-way local search and flow-based techniques that
take a more global view, refine the partition to achieve high quality.
Optionally, a memetic algorithm evolves a pool of solution candidates to obtain
even higher quality.
We evaluate KaHyPar on a large set of instances from a wide range of application domains. With respect to quality, KaHyPar outperforms all previously considered systems that can handle large hypergraphs such as hMETIS, PaToH, Mondriaan, or Zoltan. KaHyPar is also faster than most of these systems except for PaToH which represents a different speed-quality tradeoff. The results even extend to the special case of graph partitioning, where specialized systems such as KaHIP should have an advantage.
Authors: David Garcia-Soriano, Francesco Bonchi
Download: PDF
Abstract: We study a novel problem of fairness in ranking aimed at minimizing the
amount of individual unfairness introduced when enforcing group-fairness
constraints. Our proposal is rooted in the distributional maxmin fairness
theory, which uses randomization to maximize the expected satisfaction of the
worst-off individuals. We devise an exact polynomial-time algorithm to find
maxmin-fair distributions of general search problems (including, but not
limited to, ranking), and show that our algorithm can produce rankings which,
while satisfying the given group-fairness constraints, ensure that the maximum
possible value is brought to individuals.
Authors: Ilias Diakonikolas, Daniel M. Kane, Daniel Kongsgaard, Jerry Li, Kevin Tian
Download: PDF
Abstract: We study the problem of list-decodable mean estimation, where an adversary
can corrupt a majority of the dataset. Specifically, we are given a set $T$ of
$n$ points in $\mathbb{R}^d$ and a parameter $0< \alpha <\frac 1 2$ such that
an $\alpha$-fraction of the points in $T$ are i.i.d. samples from a
well-behaved distribution $\mathcal{D}$ and the remaining $(1-\alpha)$-fraction
of the points are arbitrary. The goal is to output a small list of vectors at
least one of which is close to the mean of $\mathcal{D}$. As our main
contribution, we develop new algorithms for list-decodable mean estimation,
achieving nearly-optimal statistical guarantees, with running time $n^{1 +
o(1)} d$. All prior algorithms for this problem had additional polynomial
factors in $\frac 1 \alpha$. As a corollary, we obtain the first almost-linear
time algorithms for clustering mixtures of $k$ separated well-behaved
distributions, nearly-matching the statistical guarantees of spectral methods.
Prior clustering algorithms inherently relied on an application of $k$-PCA,
thereby incurring runtimes of $\Omega(n d k)$. This marks the first runtime
improvement for this basic statistical problem in nearly two decades.
The starting point of our approach is a novel and simpler near-linear time robust mean estimation algorithm in the $\alpha \to 1$ regime, based on a one-shot matrix multiplicative weights-inspired potential decrease. We crucially leverage this new algorithmic framework in the context of the iterative multi-filtering technique of Diakonikolas et. al. '18, '20, providing a method to simultaneously cluster and downsample points using one-dimensional projections -- thus, bypassing the $k$-PCA subroutines required by prior algorithms.
Authors: Vincent Cohen-Addad, Silvio Lattanzi, Slobodan Mitrović, Ashkan Norouzi-Fard, Nikos Parotsidis, Jakub Tarnawski
Download: PDF
Abstract: Correlation clustering is a central topic in unsupervised learning, with many
applications in ML and data mining. In correlation clustering, one receives as
input a signed graph and the goal is to partition it to minimize the number of
disagreements. In this work we propose a massively parallel computation (MPC)
algorithm for this problem that is considerably faster than prior work. In
particular, our algorithm uses machines with memory sublinear in the number of
nodes in the graph and returns a constant approximation while running only for
a constant number of rounds. To the best of our knowledge, our algorithm is the
first that can provably approximate a clustering problem on graphs using only a
constant number of MPC rounds in the sublinear memory regime. We complement our
analysis with an experimental analysis of our techniques.
Authors: Talya Eden, Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, Tal Wagner
Download: PDF
Abstract: We consider the problem of estimating the number of distinct elements in a
large data set (or, equivalently, the support size of the distribution induced
by the data set) from a random sample of its elements. The problem occurs in
many applications, including biology, genomics, computer systems and
linguistics. A line of research spanning the last decade resulted in algorithms
that estimate the support up to $ \pm \varepsilon n$ from a sample of size
$O(\log^2(1/\varepsilon) \cdot n/\log n)$, where $n$ is the data set size.
Unfortunately, this bound is known to be tight, limiting further improvements
to the complexity of this problem. In this paper we consider estimation
algorithms augmented with a machine-learning-based predictor that, given any
element, returns an estimation of its frequency. We show that if the predictor
is correct up to a constant approximation factor, then the sample complexity
can be reduced significantly, to \[ \ \log (1/\varepsilon) \cdot
n^{1-\Theta(1/\log(1/\varepsilon))}. \] We evaluate the proposed algorithms on
a collection of data sets, using the neural-network based estimators from {Hsu
et al, ICLR'19} as predictors. Our experiments demonstrate substantial (up to
3x) improvements in the estimation accuracy compared to the state of the art
algorithm.
Authors: Ankur Moitra, Elchanan Mossel, Colin Sandon
Download: PDF
Abstract: In this work, we study the computational complexity of determining whether a
machine learning model that perfectly fits the training data will generalizes
to unseen data. In particular, we study the power of a malicious agent whose
goal is to construct a model g that fits its training data and nothing else,
but is indistinguishable from an accurate model f. We say that g strongly
spoofs f if no polynomial-time algorithm can tell them apart. If instead we
restrict to algorithms that run in $n^c$ time for some fixed $c$, we say that g
c-weakly spoofs f. Our main results are
1. Under cryptographic assumptions, strong spoofing is possible and 2. For any c> 0, c-weak spoofing is possible unconditionally
While the assumption of a malicious agent is an extreme scenario (hopefully companies training large models are not malicious), we believe that it sheds light on the inherent difficulties of blindly trusting large proprietary models or data.
What is a natural question? (\(\mathbb{M}\)) Gasarch on distinguishing notions of interestingness of mathematical problems based on ability to answer them, versus whether they lead to deeper mathematics.
Bad geometry in an illustration of parallel parking advice to change steering lock-to-lock when you reach a 45° angle (\(\mathbb{M}\)): this car is going to end up halfway over the curb. Really the inflection angle depends on the distance from curb and your car’s turning radius. You should switch direction when the rear axle midpoint is halfway between its starting and ending lines of motion, or approximately when the passenger rear wheel is a bit less than halfway to the curb. This car went past that point and can’t straighten before reaching the curb.
How smooth can Nash–Kuiper embeddings be? (\(\mathbb{M}\)) Recent progress identifies a strict threshold at \(C^{1,1/2}\). A new popularized and high-level survey on Quanta. See Wikipedia for more on the Nash embedding theorem (here, in the form that all 2d surfaces have non-smooth embeddings in 3d preserving distances along surface curves) and on the notions of fractional smoothness considered here, left unexplained by the Quanta article.
Interesting progress report by Peter Scholze on the formalized and computer-verified proof of some difficult material in analytic geometry (\(\mathbb{M}\)).
A mathematical foundation for foundation paper pieceable quilts (\(\mathbb{M}\), via). They formalize things by hypergraphs, but I prefer antimatroids: a pattern is piecable if and only if you can slice off one piece at a time by guillotine cuts, eventually slicing everything. The possible slicing orders form an antimatroid, because a piece, once sliceable, remains so until it is sliced. The piecing order is the reverse of a slicing order.
Firefox’s “reader mode”, in which you can click on little page-with-text icon next to the url, and get view of text stripped of extraneous framing, is useful (\(\mathbb{M}\)), but it would be more useful if it actually worked for pages with mathematical content.
One of the footpaths through my neighborhood is next to a small field with two chess tables (\(\mathbb{M}\)). They’ve been overgrown and largely unused for quite a while, but a few months back the neighborhood association’s gardeners cleared the field and planted some ginkgo trees around the tables. Someone has been playing chess there with pine cones:
Scott Aaronson, in Quanta, on why the glib popularizations of quantum computing typically found in publications like Quanta are usually wrong (\(\mathbb{M}\), see also).
The amazing, autotuning sandpile (\(\mathbb{M}\)). Jordan Ellenberg on abelian sandpile models, with pretty pictures of the fractal patterns they generate, from 2015. These patterns are still a topic of active research; see e.g. “The Apollonian structure of integer superharmonic matrices”, Ann. Math. 2017, , and “Harmonic dynamics of the abelian sandpile”, PNAS 2019.
Dodecahedral trajectory (\(\mathbb{M}\)). Mathematical glass artist Bronna Butler visualizes closed geodesics from and returning to a single vertex on the dodecahedron, using stained glass. These paths exist only on the dodecahedron, among the Platonic solids: geodesics starting from a vertex on the other solids will either hit another vertex before returning, or never return.
Automatic face-recognition systems used to auto-crop social media photos discovered to be racist and sexist (\(\mathbb{M}\)). This sort of thing is part of a long line of stories of computer systems learning to replicate the biases of their developers, and one of many reasons we need greater diversity in CS.
Introduction to Computational Origami (\(\mathbb{M}\)), Ryuhei Uehara’s 2020 book, is the subject of a new featured review by Tom Hull on MAA Reviews.
Howell’s moving orbits (\(\mathbb{M}\)). Greg Egan explains the early research of Kathleen Howell on the existence of first-order-stable halo orbits of a third small body under the gravitational influence of two larger bodies, related to the (unstable) L1 and L2 Lagrange points.
The 2021 Bridges Conference Mathematical Art Galleries (\(\mathbb{M}\)) are online even though the online conference itself is not until August. I haven’t yet had time to explore all of them but there are lots of good pieces in there.
There are two ways to write error free programs; only the third one works–Alan Perlis
Perlis on Coding Joy source |
Alan Perlis, the first Turing Award winner, summarized the whole issue of program correctness in his single quote. Maybe there is nothing more to say about it.
But this coming Thursday, if you want to hear more about it, you can tune in for a debate with Rich de Millo and myself. It will be 7:00–8:30pm Eastern time.
Harry Lewis of Harvard University will moderate. It will be broadcast LIVE HERE. It’s free.
Perlis and Rich DeMillo and I wrote the paper “Social Processes and Proofs of Theorems
and Programs” in the middle of the second half of the last century—pushing towards fifty years ago. We were responding to the then-common view that programs should be proved correct—proved in the same manner that one proves theorems like:
Theorem 1 (Euclid) There are an infinite number of prime numbers.
We begged to object: We felt that correctness of programs was fundamentally a different issue. This is what the debate is all about.
The debate will give us all a chance to reflect now almost 50 years later when programs enrich and regulate so much more of our lives. Perhaps the real winner in society goes according to the proverb: “the one who ends up with the most toys.” In that case, Harry the moderator will win hands down:
Mary-Claire van Leunen is an expert on writing, especially for technical articles. See her famous book.
The article was heavily written by Rich with input from Alan and myself. The ideas are due to all of us. The actual details, the words, the punctuation owe more to Rich with strong input from Mary-Claire.
Our paper is one of forty-six papers included in the book Ideas That Created the Future: Classic Papers of Computer Science, which was edited by Harry for MIT Press (2020). They are presented in chronological order beginning with Aristotle (~350 BCE). The median year is 1962-63.
See How to Have Your Abstract Rejected for a tongue-in-cheek view of writing technical material. It was written by Mary-Claire and myself.
See you Thursday.
[H/T Jelani Nelson]
In recent years there has been increasing interest in using machine
learning to improve the performance of classical algorithms in
computer science, by fine-tuning their behavior to adapt to the
properties of the input distribution. This “data-driven” or
“learning-based” approach to algorithm design has the potential to
significantly improve the efficiency of some of the most widely used
algorithms. For example, it has been used to design better data
structures, online algorithms, streaming and sketching algorithms,
market mechanisms and algorithms for combinatorial optimization,
similarity search and inverse problems. This virtual workshop will
feature talks from experts at the forefront of this exciting area.
The workshop will take place virtually on July 13-14, 2021.
Registration is free but mandatory. Link to register: https://fodsi.us/ml4a.html
Confirmed Speakers:
Organizers:
(h/t Salil Vadhan)
The Journal of the ACM is looking for a new editor in chief: see the call for nominations. The (soft) deadline to submit nominations (including self nominations) is July 19th and you can do so by emailing Chris Hankin at c.hankin@imperial.ac.uk
A call for nominations for her successor
In her original orange |
Éva Tardos’s sentence as Editor-in-Chief (EiC) of the Journal of the ACM (JACM) is coming to an end. She has served almost six years. She will be paroled soon for her good behavior.
Today we announce that a grand jury is being convened. This group is charged with finding the right person to be indicted as the next editor-in-chief of the JACM.
Tips on possible suspects should be directed to lieutenant Chris Hankin here. Note that the location is not far from Scotland Yard. He is in charge of the grand jury:
Laura Haas (University of Massachusetts Amherst, US)
Orna Kupferman (Hebrew University, Israel)
Marta Kwiatkowska (University of Oxford, UK)
Brad Myers (Carnegie Mellon University, US)
Salil Vadhan (Harvard University, US)
Albert Zomaya (University of Sydney, Australia)
Divesh Srivastava (AT&T, US), ACM Pubs Board Liaison
The JACM was established in 1954 and has published over 3,000 papers, since then. There were 196 submissions received in 2020 alone. The current editorial board is this. Previous editors-in-chief are:
1954 – 1958: Franz Alt
1959 – 1962: Mario Juncosa
1963 – 1965: Richard Hamming
1966 – 1968: Calvin Gotlieb
1969 – 1972: Gerard Salton
1973 – 1975: Raymond Miller
1976 – 1979: Edward Coffman, Jr.
1979 – 1982: Michael Garey
1983 – 1986: Michael Fischer
1986 – 1990: Daniel Rosenkrantz
1991 -1997: Tom Leighton
1997 – 2003: Joseph Halpern
2003 – 2009: Prabhakar Raghavan
2009 – 2015: Victor Vianu
2015 – currently: Éva Tardos
Typically many candidates are available, but only the most dramatic cases will be considered. Nominations should include a vita along with a brief statement of why the nominee should be considered. Why they are un-worthy to be made into the editor-in-chief? A vision statement is highly encouraged. The relevant law is found here:
Roles and Responsibilities in ACM Publishing
ACM’s Evaluation Criteria for Editors-in-Chief
All kidding aside. We thank Éva for her hard work keeping the JACM one of the top journals in the world. We can only hope that the next chief will be nearly as sucessful as she has been.
You may have heard of the Zeckendorf representation according to which any positive integer can be represented as a sum of non-consecutive Fibonacci numbers. Its uses include the optimal strategy in the game of Fibonacci nim. But did you know that it’s possible to efficiently add and subtract Zeckendorf representations?^{1} The algorithm from the paper linked above takes three passes over the input digit sequences using finite state automata, much like binary number addition can be performed by a single pass of a finite state automaton. I thought it might be interesting to describe an alternative path to the same result, using chip-firing games.
Chip-firing games, or sandpile models, are systems described by a graph, with markers of some kind such as coins on its vertices. The graph may possibly be directed, although in the simplest examples it is undirected, and it may be infinite. Each vertex may have arbitrarily many coins on it. But, if a vertex has at least as many coins as it has outgoing edges, we can “fire” the vertex, moving one coin to each neighbor. This repeats until no more such events are possible. Doing this on an infinite square grid with an initial state that piles many coins on a single vertex leads to pretty fractals; here’s a detail of the result of starting with 30 million coins:
However, we’ll be looking at this sort of system on one-dimensional infinite graphs, where their behavior is not quite as complicated.
Chip-firing, on any graph and any placement of coins for which it terminates, forms an antimatroid.^{2} ^{3} A vertex \(v\) can fire for the \(i\)th time, as long as it has already fired \(i-1\) times and a total of \(i\cdot\deg(v)\) coins have reached it. Once these conditions are met, they remain true until \(v\) fires. This idea, that once the item \((v,i)\) becomes available to be added to the sequence of firings it remains available until it is added, is the defining principle of an antimatroid. From it one can prove that, if any firing sequence terminates in a stable configuration, then all sequences terminate, they all fire the same vertices the same numbers of times, and they all end in the same stable configuration.
A binary representation of a number \(x\) is just a set of distinct powers of two, summing to \(x\). If you add two binary numbers \(x\) and \(y\), you can combine their sets into a single multiset; carrying can be thought of as a systematic method of getting rid of the duplicate powers of two in this multiset. Whenever you have two equal powers of two in a multiset, whose sum you are trying to represent, you can merge them with the fusion rule
\[2^i+2^i \Rightarrow 2^{i+1}.\]This can be thought of as a chip-firing game on a graph where there is a vertex for every power of two, an edge to the next larger power of two, and an edge to a “bit bucket” vertex with no outgoing edges (which cannot be fired). Each instantiation of the fusion rule fires vertex \(2^i\), moving one coin to \(2^{i+1}\) and one to the bit bucket.
This is exactly what we are doing when we carry a term in binary addition: taking two powers of two from a column of the addition problem and fusing them into a single power of two in the next column. We can also think of this physically, using a one-dimensional array of cells with poker chips or coins on them, with a coin on cell \(i\) representing the number \(2^i\). The fusion rule takes two coins from any cell and replaces them by a single coin in the next cell. In binary addition of pairs of numbers, there are at most three coins per cell (the two that started out there and one carry), but you could use the same fusion rule for addition of more than two numbers, using an array of cells that can each contain a large pile of coins.
Carrying the coin analogy further, we might imagine that the coins have values that are powers of two, and that we are making change by replacing pairs of small coins by a single larger coin of equal value.
Conventional binary arithmetic does these fusion steps in a systematic order, from low-order bits of the binary representation (smaller powers of two) to higher-order bits (larger powers of two). It’s that systematic order, together with the observation that each pile has at most three coins, that makes this method suitable for a finite state machine. But actually, you could apply the fusion rule in any order, and it would work equally well. Each step reduces the number of coins by one, so you can never do more steps than your starting number of coins. We can only halt when we reach the binary representation of the sum, which is uniquely determined, so this process is confluent: every sequence of choices leads to the same eventual outcome. In this case, confluence can also be seen from the antimatroid property of chip-firing games.
This observation about reducing the number of powers of two gives an immediate proof of a fact about binary representations that you may not have known: the binary representation of \(x+y\) has at most as many nonzero bits as there are in \(x\) and \(y\) separately.
Now, instead of coins with power-of-two values, let’s suppose that they have Fibonacci numbers as their values, representing any multiset of Fibonacci numbers. We can still combine pairs of Fibonacci numbers by a fusion rule, that now applies when we have two coins in adjacent piles:
\[F_i + F_{i+1} \Rightarrow F_{i+2}.\]But in order to get the Zeckendorf representation of the sum of the multiset, we also need to deal with pairs of coins that have the same value, because repeated copies of the same Fibonacci number are not allowed in the Zeckendorf representation. We can accomplish this by a “fission” rule, that steps backwards according to the fusion rule in order to take a bigger step forwards:
\[F_i+F_i \Rightarrow F_{i-2}+F_{i-1}+F_i\Rightarrow F_{i-2}+F_{i+1}.\]This makes sense even when \(F_i\) is \(1\) or \(2\): for \(F_i=1\), the \(F_{i-2}\) term is zero and can be dropped from the sum, and for \(F_i=2\) it equals one, a Fibonacci number. So as special cases of this rule we have \(1+1\Rightarrow 2\) and \(2+2\Rightarrow 1+3\).
Several issues complicate the analysis of this replacement system. First, although fission by itself is a chip-firing rule on an infinite graph with outdegree two, fission and fusion together make a more complicated system that is not an antimatroid, so different orderings of choosing what to do may lead to different numbers of steps. Second, the fission rule changes the piles of coins both to the left and to the right of the pile to which it applies, making it harder to find a consistent ordering in which to apply these rules. And third, fission preserves the number of coins rather than reducing them, making both the eventual termination of the system and the analysis of how many steps it takes less obvious. But assuming it does terminate, it can only terminate at the Zeckendorf representation, which is unique. So termination implies confluence.
First let’s see what happens if we just use the fission rule, forgetting about the fusion rule. Because fission by itself is a chip-firing rule, any initial state has an invariant number of firings and an invariant final state, regardless of the order of operations. For an initial state with \(n\) copies of \(F_i\), the final state gives us an expansion of \(F_i\) into a sum of \(n\) distinct Fibonacci numbers:
\[F_i = F_i\] \[2\,F_i = F_{i-2}+F_{i+1}\] \[3\,F_i=F_{i-2}+F_i+F_{i+1}\] \[4\,F_i=F_{i-4}+F_{i-3}+F_i+F_{i+2}\] \[\vdots\]The indexes appearing in these identities can be summarized in a triangle of numbers:
\[\begin{array}{ccccccc} &&&&&&0&&&&&&\\ &&&&&-2&&\ 1\ &&&&&\\ &&&&-2&&0&&\ 1\ &&&&\\ &&&-4&&-3&&0&&\ 2\ &&&\\ &&-4&&-3&&-2&&1&&\ 2\ &&\\ &-4&&-3&&-2&&0&&1&&\ 2\ &\\ -6&&-5&&-4&&-3&&-1&&1&&\ 3\ \\ &&&&&&\vdots&&&&&&\\ \end{array}\]The largest number in each row grows only logarithmically with \(n\), because of the way the values of the Fibonacci numbers grow exponentially as a function of their index. The fission rule cannot create gaps of three or more empty cells, so consecutive numbers in a row differ by at most three, implying that the smallest number in each row is greater than \(-3n\). Because of this linear bound on how far fission can move the coins, we get a quadratic bound on the total number of firing events: each firing decreases the total distance of the coins from cell \(-3n\). This total distance starts at \(3n^2\) and remains positive, so there can be at most \(3n^2\) firings. In computational experiments up to \(n=10000\) the actual number of firings never exceeded \(n^2\) and appeared to be growing as \(n^2\bigl(1-o(1)\bigr)\). This is a big contrast from the behavior of the similar-looking chip-firing rule on a line of cells that replaces pairs of coins by a coin one cell to the left and a coin one cell to the right, which (with \(n\) chips starting in a pile on one cell) produces a square pyramidal number of firings, cubic in \(n\).
For initial coins that are not all in one big pile, it remains impossible for fission to produce new long gaps, and we can decompose the sequence of cells into subsequences of linear length separated by gaps too long to be crossed. The same analysis of total distance of coins from the starts of sequences of cells shows that, regardless of starting state, and even if we also allow fusion rules, we get an \(O(n^2)\) bound on the total number of steps. In particular, this system always terminates.
Fortunately, we don’t have to take quadratic time to simplify sums of Fibonacci numbers into their Zeckendorf representations using these rules. The problem with the previous analysis was that we were doing too much fission before we did any fusion. Suppose we sidestep that with the following prioritization:
If any fusion steps are possible, there must be at least one (for instance, the one with the largest index) that is prioritized. To analyze these prioritized steps, define \(N\) to be the current number of coins, and \(M\) to be the number of coins that are at or below a pile with two or more coins on it. The key insight is that whenever a step moves a coin to a pile with a larger index, it cannot form a new larger-index multi-coin pile, so both \(M\) and \(N\) are non-decreasing. They can stay the same only for a fission step at a pile of three or more coins, but in that case the next step must be a fusion. \(M\) and \(N\) are both \(\le n\), and their sum decreases at least once for every two steps, so the total number of steps is \(\le 4n\). Thus, this prioritized chip-firing method can reduce any sum of Fibonacci numbers to its Zeckendorf representation in a linear number of steps. It’s also not difficult to implement so that it takes linear time.
Anyway, now that we have a way of simplifying sums of Fibonacci numbers to their Zeckendorf form, let’s use it for arithmetic on Zeckendorf representations, without the need to reduce them to simpler forms.
First, addition. This is very easy: Add the two Zeckendorf representations as multisets (producing piles of coins that might be adjacent, or might have two coins in them) and then simplify. The result is a linear time for adding two Zeckendorf representations, as was already known. But because it isn’t based on finite state machines, it can also handle more than two coins, for instance by adding multiple numbers at once rather than having to reduce the problem to multiple pairwise additions. As with binary arithmetic, the chip-firing method gives a quick proof of a non-obvious fact: the Zeckendorf representation of \(x+y\) has at most as many nonzero terms as there are in \(x\) and \(y\) put together.
It’s also possible to use this technique (or really, any linear-time Zeckendorf addition algorithm) as part of a long multiplication algorithm for Zeckendorf representations. To multiply two numbers \(x\) and \(y\), both given by their Zeckendorf representations, first perform a sequence of additions to compute the representations of the numbers \(F_i\cdot x\), each computed as \((F_{i-1}\cdot x)+(F_{i-2}\cdot x)\) in a single addition from earlier numbers in the sequence. Then pick out the subset of these representations for which \(F_i\) belongs to the Zeckendorf representation of \(y\), and add them together. (Or interleave the computation of \(F_i\cdot x\) and the sum of these representations, to save space.) The total time for \(\ell\)-bit inputs is \(O(\ell^2)\), as it is for the usual long multiplication algorithm. I’m using the different variable \(\ell\), because the time here is a function of all the bits in the input Zeckendorf representation, rather than just the number of nonzero bits.
You can convert a binary number to Zeckendorf with the same idea, computing the sequence of powers of two and then picking out and adding together the subset of these powers used in the binary representation of the input. In the other direction, you can convert a Zeckendorf representation to binary by computing the Fibonacci sequence in binary, picking out the terms from the given Zeckendorf representation, and adding them. Both directions of conversion take time \(O(\ell^2)\). A recent paper by Sergeev shows how to do these conversions much more efficient, only logarithmically slower than binary-number multiplication.^{4} Based on Sergeev’s results it would also be much more efficient (in theory at least, if not necessarily in practice) to multiply Zeckendorf representations by converting to binary, multiplying in binary, and converting back.
As the article I linked at the start hints, the direct algorithms described here are a bit unsatisfactory from the point of view of circuit complexity: the circuit size is ok but the circuit depth is much larger than it would be for binary arithmetic. I’m not convinced that this is a serious deficiency, because we’re not likely to build computers that use this arithmetic at the circuit level. Any algorithms like this are more likely to be used only in specialized situations where Zeckendorf arithmetic is relevant, like the analysis of certain combinatorial games.
Connor Ahlbach, Jeremy Usatine, Christiane Frougny, and Nicholas Pippenger (2013), “Efficient algorithms for Zeckendorf arithmetic”, Fibonacci Quarterly 51 (3): 249–255, arXiv:1207.4497, MR3093678. ↩
Anders Björner, László Lovász, and Peter Shor (1991), “Chip-firing games on graphs”, European Journal on Combinatorics 12 (4): 283–291, doi:10.1016/S0195-6698(13)80111-4, MR1120415. ↩
Kolja Knauer (2009), “Chip-firing, antimatroids, and polyhedra”, EuroComb 2009, Electronic Notes in Discrete Mathematics 34: 9–13, doi:10.1016/j.endm.2009.07.002, MR2591410. ↩
I. S. Sergeev (2018), “On the complexity of Fibonacci coding”, Problems of Information Transmission 54: 343–350, doi:10.1134/S0032946018040038, MR3917588. ↩
While in theory bipartite matching should be easy, it has been observed in practice that real instances of matching theoreticians with jobs are hard. It’s been a particularly unusual year with the entire cycle going virtual. Congrats to matched vertices on both sides!
Here is a link to a crowdsourced spreadsheet created to collect information about theory hires this year. I put in a biased pseudorandom seed, please help populate and share! Rules for the spreadsheet have been copied from previous years and all edits to the document are anonymized. Please, feel free to contact me directly or post a comment if you have any suggestions about the rules.
Theory Jobs 2021 was originally published by Grigory Yaroslavtsev at The Big Data Theory on June 12, 2021.
Scribe notes by Junu Lee, Yash Nair, and Richard Xu.
Previous post: Toward a theory of generalization learning Next post: TBD.
See also all seminar posts and course webpage.
lecture slides (pdf) – lecture slides (Powerpoint with animation and annotation) – video
Much of the material on causality is taken from the wonderful book by Hardt and Recht.
For fairness, a central source was the book in preparation by Barocas, Hardt, and Narayanan as well as the related NeurIPS 2017 tutorial, and other papers mentioned below.
We may have heard that “correlation does not imply causation”. How can we mathematically represent this statement, and furthermore differentiate the two rigorously?
Roughly speaking, and are correlated if is different for different values of . To represent causation, we change the second part of the formula: causes if intervening to change to some value changes the probability of . That is,
depends on .
Suppose we have the random variables (taken over choice of a random person), which represent eXercising, being overWeight and having Heart disease, respectively. We put forth the following (hypothetical! this is not medical advice!) scenarios for their relationships:
Scenario 1. . Now , the overweight indicator, follows the causal relation:
and the heart disease indicator follows the same rule
So, in this scenario, exercise prevents heart disease and being overweight, while if we don’t exercise, we may be overweight or suffer from heart disease with probability 1/2 independently..
Scenario 2.
and still depends on in the same rule in the previous scenario. So, in this scenario, people are naturally prone to being overweight with probability 1/4, and being overweight makes you less likely to exercise, rather than the causal relation being in the other way around. As before, exercise prevents heart disease, and someone who did not exercise will get heart disease with probability 1/2.
We find that in scenario 1, . In scenario 2,
In fact, as this table shows, the probabilities for all combinations of are identical in the two scenarios!
Now, consider the intervention of setting , i.e. stop exercising. That is, we change the generating model for to be . In scenario 1, is still . In scenario 2, tells us nothing about now so we get . Now that we added in an intervention, the two scenarios are different!
This is an example of why correlations are not causations: while the conditional probabilities identical in the two scenarios, the causal probabilities are diffent .
NOTE: Working out this example, and understanding (a) why the two scenarios induce identical probabilities, and in particular all conditional probabilities are identical and (b) why the causal probabilities differ from the conditional probabilities in Scenario 2, is a great way to get intuition for causality and its pitfalls.
Consider Scenario 1, where the causal structure is as follows:
Looking at the table above, we see that the unconditional probability equals . Since in this scenario, there is no causal relation between being overweight and suffering from heat disease, the causal probability is also equal to .
However, we can calculate the conditional probability from the table and see that .
That means that even though in this scenario, there is no causal relation between being overweight and getting heart disease, conditioning on not being overweight reduces the probability of getting heart disease.
Once again we see here a gap between the conditional and causal probabilities.
The reason is for this gap is that there is a counfounding variable, namely that is a common cause of both and .
Definition: are confounded if there are values such that
To fix the effect of a confounder, we condition on . It also allows us to find the probability of an intervention. The general deconfounding formula is
(★),
where ranges over all the immediate causes of .
Contrast this with the formula for computing the conditional probability which is
Using the deconfounding formula (★) requires (a) knowing the causal graph, and (b) observing the confounders. If we get this wrong and control for the wrong confounders we can get the causal probabilities wrong, as demonstrated by the following example.
One way to describe causality theory is that it aims to clarify the situations under which correlation does in fact equal causation (i.e., the conditional probabilities are equal to the causal probabilities), and how (by appropriately controlling for confounders) we can get to such a situation.
Example (two diseases) Consider the diagram below where there are two diseases and such that each occurs independently with probability . We assume each will send you to the hospital (variable ) and those are the only reason to arrive at the hospital.
If you control for (i.e look at only people who went to the hospital), we find that the probabilities are now correlated: A priori the probability is , and conditioned on , the probability is .
This relates to the joke “the probability of having 2 bombs on a plane is very low, so if I bring a bomb then it is very unlikely that there will be another bomb.”
In general, the causal graph can look as one of the following shapes:
If is a fork then controlling for can tease out the causal relation. If is a mediator or collider then controlling for can actually make things worse. –>
Backdoor paths: If and are two random variables, we say that there is a “backdoor path” from to if there is direct ancestor of that is connected in the undirected version of the causal graph in a path not going through .
We can show the following theorem:
Theorem: If there is no backdoor path then
Here is a “proof by picture”:
If there isn’t a backdoor path, we sort the graph in topological order, so that all the events that happen before are not connected to except through . So we can first generate all the variables that result in . Then the probability distribution of the events between and only depends on the value of , and so similarly is generated from some probability distribution that only depends on .
When we design experiments, we often want to estimate causal effects, and to do so we try to make sure we eliminate backdoor paths.
Consider the example of a COVID vaccine trial.
We let be the event that a trial participant obtained a vaccine, and be the event that the participant was infected with COVID.
We want to figure out .
However, there is a “backdoor path”.
You will not get the vaccine if you don’t participate in the trial (which we denote by ), but particpating in the trial could change your behavior and hence have a causal effect on .
To fix this we can cut the backdoor path using a placebo: it cuts the backward path by removing the confounding variable of participation, since it ensure that (conditioning on ), is now an independent variable from any behavioral changes that might impact .
In general, how does conditioning on some variable affect correlations? It may introduce correlations in events that occur before , but cuts any path that depends on .
Suppose we have some treatment variable that we don’t get to control (e.g. in a natural experiment). Let , and we hope to estimate which is known as the the treatment effect.
However, we worry that some underlying variable (e.g. healthy lifestyle) can affect both and .
The propensity score, defined as , allows us to calculate . We claim that as long as is a valid confounder (for which the formula (★) holds)
The proof is obtained by expanding out the claim, see below
Intuitively, knowing the probability that different groups of people get treatment allows us to make independent from and calculate the treatment effect.
Calculating treatment effect using ML. Suppose that the treatment effect is and . Now, if we learn a model , then
Since both and are calculable, we only need to do a linear regression.
When we cannot observe the counfounding variable, we can still sometimes use instrumental variables to estimate a causal effect.
Assume a linear model , where is the stuff we don’t observe. If is some variable that satisfies then
which is the ratio between two observable quantities.
We focus on fairness in classification problems, rather than fairness in learning generative models or representation (which also have their own issues, see in particular this paper by Bender, Gebru, McMillan-Major, and “Shmitchell”).
In the public image, AI has been perceived to be very successful for some tasks, and some people might hope that it is more “objective” or “impartial” than human decisions which are known to be fraught with bias). However, there are some works suggesting this might not be the case:
Drug use in Oakland:
Drug arrests in Oakland:
While algorithms can sometimes also help, the populations they help might not be distributed equally. For example, see this table from Gates, Perry and Zorn. A more accurate underwriting model (that can better predict the default probability) enables a lender to use a more agressive risk cut off and so end up lending to more people.
However, this is true within each subpopulation too, so it may be that if the model is less accurate in a certain subpopulation, then a profit-maximizing lender will unfairly offer fewer loans to this subpopulation.
In the case of employment discrimination in the U.S., we have the following components:
Employers are not allowed to discriminate across protected classes when hiring. The unfairness metric gives us a way to measure if there is discrimination with respected to a protected class. In particular, disparate impacts across different protected classes is often necessary but not sufficient evidence of discrimination.
To see why algorithms, which at first glance seem agnostic to group membership, may exhibit disparate treatment or impact, we consider the following Google visualization by Wattenberg, Viégas, and Hardt.
Consider a blue population and an orange population for which there is no difference in the probability of a member of either population paying back the loan, but for which our model has different accuracies—in particular, the model is more accurate on the orange population. This is described by the plot below, in which the scores correspond to the model’s prediction of the probability of paying back the loan and opaque circles correspond to those who actually do not pay back the loan, whereas filled in circles correspond to those who do.
Suppose we are in charge of making a lending decision given the model prediction.
A scenario in which we give everyone a loan would be fair, but would be bad us —we would go bankrupt!
Profit when giving everyone a loan:
If we wanted to maximize profit, we would, however, give more loans to the orange population (since we’re more sure about which members of the orange population would actually pay back their loans) by setting a lower threshold (in terms of the score given by our algorithm) above which we give out loans.
This maximizes profit but is blatantly unfair. We are treating the identical blue and orange groups differently, just because our model is more accurate on one than the other, and we also have disparate impact on the two groups. A non-defaulting applicant would be 78% likely to get a loan if they are a member of the orange group, but only 60% likely to get a loan if they are a member of the orange group.
This “profit maximization” is likely the end result of any sufficiently complex lending algorithm in the absence of a fairness intervention. Even if the algorithm does not explicitly rely on the group membership attribute, by simply optimizing it to maximize profit, it may well pick up on attributes that are correlated with group membership.
Suppose on the other hand that we wanted to mandate “equal treatment” in the sense of keeping the same thresholds for the blue and orange group. The result would be the following:
In this case, since the threshold are identical, the algorithm will be calibrated. 79% of the decisions we make will be the correct ones, for both the blue and orange population. So, from our point of view, the algorithm is fair and treats the blue and orange populations identically. However, from the point of view of the applicants, this is not the case. If you are a blue applicant that will pay your loan, you have 81% chance of getting a loan, but if you are an orange customer you only have 60% of getting it. This demonstrates that defining fairness is quite delicate. In particular the above “color blind” algorithm is still arguable unfair.
This difference between the point of view of the lender and lendee also arose in the recidivism case mentioned above. From the point of view of the defendant that would not recidivate, the algorithm was more likely to label them as “high risk” if they were Black than if they were white. From the point of view of the decision maker, the algorithm was calibrated, and if anything it was a bit more likely that a white defendant labeled high risk would not recidivate than a Black defendant. See (slightly rounded and simplified) data below
If we wanted to achieve demographic parity (both populations get same total number of loans) or equal opportunity (true positive rate same for both) then we can do so, but again using different thresholds for each group:
While the above was a hypothetical scenario, a real life example was shown by Hardt, Price and Srebro using credit (also known as FICO) scores, as described by the plot below:
For a single threshold, around 75% of Asian candidates will get loans, whereas only around 20% of Black candidates will get loans. To ensure that all groups get loans at the same rate, we would need to set the thresholds differently. In order to equalize opportunity, we’d also need to initialize the thresholds differently as well.
We see that we have different notions of what it means to be fair and that each of these different notions result in different algorithms.
Berkeley graduate admissions in 1973 had the following statistics:
This paradox is commonly referred to as Simpson’s Paradox.
A “fair” causal model for this scenario might be as follows:
In the above, perhaps gender has a causal impact on the choice of department to which the applicant applies. However, a fair application process would, conditional on the department, be independent of gender of the applicant.
However, not all models that follow this causal structure are necessarily fair. In the case Griggs v. Duke Power Co., 1971, the court ruled that decision-making under the following causal model was unfair:
While the model appears to be fair, since the job offer is conditionall independent of race, given the diploma, the court ruled that the job did not actually require a high school diploma. Hence, using the diploma as a factor in hiring decisions was really just a proxy for race, resulting in essentially purposeful unfair discrimination based on race. This creation of proxies is referred to as redlining.
We cannot come up with universal fairness criteria. The notion of fairness itself is based on assumptions about:
Fairness depends on what we choose to measure to observe, in both inputs and outputs, and how we choose to act upon them. In particular, we have the following causal structure, wherein measure inputs, decision-making, and measured outcomes all play a role in affecting the real-world and function together in a feedback cycle:
A more comprehensive illustration is given in this paper of Friedler, Scheidegger, and Venkatasubramanian:
I just heard that Benny Chor died this morning. Chor did very important work on computational biology and distributed algorithms, but I (and probably many of my readers) know him primarily for his work on cryptography, for his work on randomness extraction and for introducing the notion of private information retrieval.
I only met him once, at the event for Oded Goldreich’s 60th birthday. On the occasion, he gave a talk on the Chor-Goldreich paper, which introduced the problem of randomness extraction from independent sources, and which introduced min-entropy as the right parameter by which to quantify the randomness content of random sources. He did so using the original slides used for the FOCS 1985 talk.
I took a picture during the talk, which I posted online, and later he sent me an email asking for the original. Sadly, this was the totality of our correspondence. I heard that besides being a brilliant and generous researchers, he was a very playful, likeable and nice person. My thoughts are with his family and his friends.
This coming Fall semester the Simons Institute for the Theory of Computing in Berkeley will have in-person activities, including the really interesting program on the complexity of statistical inference, within which I will co-organize a workshop on cryptography, average-case complexity, and the complexity of statistical problems.
As it had been the case before the pandemic, all Simons Institute events will be streamed and available remotely. This includes a new series of Public Lectures called “Breakthroughs” that starts next week with a talk by Virginia Williams on matrix multiplication.
The other night Dana and I watched “The Internet’s Own Boy,” the 2014 documentary about the life and work of Aaron Swartz, which I’d somehow missed when it came out. Swartz, for anyone who doesn’t remember, was the child prodigy who helped create RSS and Reddit, who then became a campaigner for an open Internet, who was arrested for using a laptop in an MIT supply closet to download millions of journal articles and threatened with decades in prison, and who then committed suicide at age 26. I regret that I never knew Swartz, though he did once send me a fan email about Quantum Computing Since Democritus.
Say whatever you want about the tactical wisdom or the legality of Swartz’s actions; it seems inarguable to me that he was morally correct, that certain categories of information (e.g. legal opinions and taxpayer-funded scientific papers) need to be made freely available, and that sooner or later our civilization will catch up to Swartz and regard his position as completely obvious. The beautifully-made documentary filled me with rage and guilt not only that the world had failed Swartz, but that I personally had failed him.
At the time of Swartz’s arrest, prosecution, and suicide, I was an MIT CS professor who’d previously written in strong support of open access to scientific literature, and who had the platform of this blog. Had I understood what was going on with Swartz—had I taken the time to find out what was going on—I could have been in a good position to help organize a grassroots campaign to pressure the MIT administration to urge prosecutors to drop the case (like JSTOR had already done), which could plausibly have made a difference. As it was, I was preoccupied in those years with BosonSampling, getting married, etc., I didn’t bother to learn whether anything was being done or could be done about the Aaron Swartz matter, and then before I knew it, Swartz had joined Alan Turing in computer science’s pantheon of lost geniuses.
But maybe there was something deeper to my inaction. If I’d strongly defended the substance of what Swartz had done, it would’ve raised the question: why wasn’t I doing the same? Why was I merely complaining about paywalled journals from the comfort of my professor’s office, rather than putting my own freedom on the line like Swartz was? It was as though I had to put some psychological distance between myself and the situation, in order to justify my life choices to myself.
Even though I see the error in that way of “thinking,” it keeps recurring, keeps causing me to make choices that I feel guilt or at least regret about later. In February 2020, there were a few smart people saying that a new viral pneumonia from Wuhan was about to upend life on earth, but the people around me certainly weren’t acting that way, and I wasn’t acting that way either … and so, “for the sake of internal consistency,” I didn’t spend much time thinking about it or investigating it. After all, if the fears of a global pandemic had a good chance of being true, I should be dropping everything else and panicking, shouldn’t I? But I wasn’t dropping everything else and panicking … so how could the fears be true?
Then I publicly repented, and resolved not to make such an error again. And now, 15 months later, I realize that I have made such an error again.
All throughout the pandemic, I’d ask my friends, privately, why the hypothesis that the virus had accidentally leaked from the Wuhan Institute of Virology wasn’t being taken far more seriously, given what seemed like a shockingly strong prima facie case. But I didn’t discuss the lab leak scenario on this blog, except once in passing. I could say I didn’t discuss it because I’m not a virologist and I had nothing new to contribute. But I worry that I also didn’t discuss it because it seemed incompatible with my self-conception as a cautious scientist who’s skeptical of lurid coverups and conspiracies—and because I’d already spent my “weirdness capital” on other issues, and didn’t relish the prospect of being sneered at on social media yet again. Instead I simply waited for discussion of the lab leak hypothesis to become “safe” and “respectable,” as today it finally has, thanks to writers who were more courageous than I was. I became, basically, another sheep in one of the conformist herds that we rightly despise when we read about them in history.
(For all that, it’s still plausible to me that the virus had a natural origin after all. What’s become clear is simply that, even if so, the failure to take the possibility of a lab escape more seriously back when the trail of evidence was fresher will stand as a major intellectual scandal of our time.)
Sometimes people are wracked with guilt, but over completely different things than the world wants them to be wracked with guilt over. This was one of the great lessons that I learned from reading Richard Rhodes’s The Making of the Atomic Bomb. Many of the Manhattan Project physicists felt lifelong guilt, not that they’d participated in building the bomb, but only that they hadn’t finished the bomb by 1943, when it could have ended the war in Europe and the Holocaust.
On a much smaller scale, I suppose some readers would still like me to feel guilt about comment 171, or some of the other stuff I wrote about nerds, dating, and feminism … or if not that, then maybe about my defense of a two-state solution for Israel and Palestine, or of standardized tests and accelerated math programs, or maybe my vehement condemnation of Trump and his failed insurrection. Or any of the dozens of other times when I stood up and said something I actually believed, or when I recounted my experiences as accurately as I could. The truth is, though, I don’t.
Looking back—which, now that I’m 40, I confess is an increasingly large fraction of my time—the pattern seems consistent. I feel guilty, not for having stood up for what I strongly believed in, but for having failed to do so. This suggests that, if I want fewer regrets, then I should click “Publish” on more potentially controversial posts! I don’t know how to force myself to do that, but maybe this post itself is a step.
Faculty hiring in computer science is a process long due for an overhaul. The pandemic certainly changed some of the dynamics moving most of the interviews online and saving a ton of money and time. Will this be the start of a fresh approach to recruiting?
A typical search in the past few years had some schools flying in 30-40 candidates, typically costing over a $1000 each and a full-time job for a staff member during the search. We'd justify the expense as small compared to the millions we'd invest in a faculty member throughout their career, but it is generally the largest discretionary expense for a CS department. It also gives advantages to rich departments over others.
During the pandemic all those interviews moved online and worked reasonably well at virtually no additional cost. Also no need to scrounge around to find faculty willing to skip family meals to have dinner with the candidates. And if a faculty had a conflict with a candidate on the interview day, they could schedule on a different day. There really is no reason to have all the meetings on the same day.
With the pandemic mostly behind us, will we go back to in-person interviews moving forward. I suspect the airport interview, where you fly out 20 or so candidates to have hour long interviews in a hotel near an airport with a search committee for an administrative position, will be the first to go completely virtual.
Even for regular faculty interviews, there will be great pressure to reduce the number of in-person visits, perhaps to just the top candidates, or just the ones who have offers--make the "second visit" the only visit. Richer departments may find the expense worthwhile to make a bigger impression on the candidates and that will only expand the advantage of wealthier universities.
Times like this are the perfect opportunity for CS leadership to come in and give some sanity to the hiring process but I'm not holding my breath.
After 16 months without lecturing to an audience in my same location, I gave yesterday two lectures at the Technion in front of a live audience (and some additional audience in remote locations). The main lecture was in COMSOC 2021, an international conference on computational social choice, and earlier I gave a guest lecture in Roy Meshulam’s class about simple polytopes. I also met many friends.
Reshef Meir who organized (with Bill Zwicker) COMSOC 2021 wrote:
Hi all,today was beyond expectations – the first feeling of a real actual conference after almost a year and a half! We had about 40 people attending, viewing posters, and listening to talks. I truly hope this will return to be a common scene and that we can all meet face to face soon.
In my COMSOC lecture I talked about some earlier ideas and results in my work on social choice, starting with my paper with Ariel Rubinstein and Rani Spiegler on rationalizing individual choice by multiple rationals, and my subsequent attempt to use learnability as a tool for understanding choices of economic agents. This led to interesting questions on social choice that are discussed in this 2009 post.
In Roy’s course I explained -vectors of polytopes and the Dehn-Sommerville relations based on counting outdegrees of the graph of the polytope when we direct its edges based on a generic abstract objective function. I moved on to present a proof of Blind-Mani’s theorem that the graph of the polytope determines the full combinatorics. This proof is probably the one proof I presented the most and it is given in this 2009 post.
In my COMSOC lecture I described how to fill the two question marks in the table above.
We hope you are all staying safe. With massive vaccination programs across the globe we hope you and your loved ones are getting back to what used to be normal. With that out of the way, let us circle back to Property Testing. This month was less sleepy as compared to the two preceding months and we saw six papers in total (two of them explore problems in quantum property testing). Without further ado, let us take a deeper dive.
GSF-locality is not sufficient for proximity-oblivious testing, by Isolde Adler, Noleen Kohler, Pan Peng (arXiv) The notion of proximity oblivious testers was made explicit in the seminal work of Goldreich and Ron in 2009 [GR09]. A proximity oblivious tester for a graph property is a constant query tester that rejects a graph with probability that monotonically increases with distance to the property. (Edit: Correction) A property is called proximity oblivious testable (or PO testable) if it has a one sided proximity oblivious tester. [GR09] gave a characterization of which properties \(\Pi\) are PO testable in the bounded degree model if and only if it is a “local” property of some kind which satisfies a certain non propagation condition. [GR09] conjectured that all such “local” properties satisfy this non propagation condition. This paper refutes the above conjecture from [GR09].
Coming up next. More action on triangle freeness.
Testing Triangle Freeness in the General Model in Graphs with Arboricity \(O(\sqrt n)\), by Reut Levi (arXiv) PTReview readers are likely to be aware that triangle freeness has been a rich source of problems for developing new sublinear time algorithms. This paper considers the classic problem of testing triangle freeness in general graphs. In the dense case, algorithms with running time depending only on \(\varepsilon\) are known thanks to the work of Alon, Fischer, Krivelevich and Szegedy. In the bounded degree case, Goldreich and Ron gave testers with query complexity \(O(1/\varepsilon)\). This paper explores the problem in general graph case and proves an upper bound of \(O(\Gamma/d_{avg} + \Gamma)\) where \(\Gamma\) is the arboricity of the graph. The author also shows that this upperbound is tight for graphs with arboricity at most \(O(\sqrt n)\). Curiously enough, the algorithm does not take arboricity of the graph as an input and yet \(\Gamma\) (the arboricity) shows up in the upper and lower bounds.
Testing Dynamic Environments: Back to Basics, by Yonatan Nakar and Dana Ron (arXiv) Goldreich and Ron introduced the problem of testing “dynamic environments” in 2014. Here is the setup for this problem. You are given an environment that evolves according to a local rule. Your goal is to query some of the states in the system at some point of time and determine if the system is evolving according to some fixed rule or is far from it. In this paper, the authors consider environments defined by elementary cellular automata which evolve according to threshold rules as one of the first steps towards understanding what makes a dynamic environment tested efficiently. The main result proves the following: if your local rules satisfy some conditions, you can use a meta algorithm with query complexity \(poly(1/\varepsilon)\) which is non adaptive and has one sided error. And all the threshold rules indeed satisfy these conditions which means they can be tested efficiently.
Identity testing under label mismatch, by Clement Canonne and Karl Wimmer (arXiv) This paper considers a classic problem distribution testing with the following twist. Let \(q\) denote a distribution supported on \([n]\). You are given access to samples from another distribution \(p\) where \(p = q \circ \pi\) where \(\pi\) is some unknown permutation. Thus, I relabel the data and I give you access to samples from the relabeled dataset. Under this promise, note that identity testing becomes a trivial problem if \(q\) is known to be uniform over \([n]\). The authors develop algorithms for testing and tolerant testing of distributions under this additional promise of \(p\) being a permutation of some known distribution \(q\). The main result shows as exponential gap between the sample complexity of testing and tolerant testing under this promise. In particular, identity testing under the promise of permutation has sample complexity \(\Theta(\log^2 n)\) whereas tolerant identity testing under this promise has sample complexity \(\Theta(n^{1-o(1)})\).
Testing symmetry on quantum computers, by Margarite L. LaBorde and Mark M. Wilde (arXiv) This paper develops algorithms which test symmetries of a quantum states and changes generated by quantum circuits. These tests additionally also quantify how symmetric these states (or channels) are. For testing what are called “Bose states” the paper presents efficient algorithms. The tests for other kinds of symmetry presented in the paper rely on some aid from a quantum prover.
Quantum proofs of proximity, by Marcel Dall’Agnol, Tom Gur, Subhayan Roy Moulik, Justin Thaler (ECCC) The sublinear time (quantum) computation model has been gathering momentum steadily over the past several years. This paper seeks to understand the power of \({\sf QMA}\) proofs of proximity for property testing (recall \({\sf QMA}\) is the quantum analogue of \({\sf NP}\)). On the algorithmic front, the paper develops sufficient conditions for properties to admit efficient \({\sf QMA}\) proofs of proximity. On the complexity front, the paper demonstrates a property which admits an efficient \({\sf QMA}\) proof but does not admit a \({\sf MA}\) or an interactive proof of proximity.
I now have a feature article up at Quanta magazine, entitled “What Makes Quantum Computing So Hard To Explain?” I.e., why do journalists, investors, etc. so consistently get central points wrong, even after the subject has been in public consciousness for more than 25 years? Perhaps unsurprisingly, I found it hard to discuss that meta-level question, as Quanta‘s editors asked me to do, without also engaging in the object-level task of actually explaining QC. For regular Shtetl-Optimized readers, there will be nothing new here, but I’m happy with how the piece turned out.
Accompanying the Quanta piece is a 10-minute YouTube explainer on quantum computing, which (besides snazzy graphics) features interviews with me, John Preskill, and Dorit Aharonov.
On a different note, my colleague Mark Wilde has recorded a punk-rock song about BosonSampling. I can honestly report that it’s some of the finest boson-themed music I’ve heard in years. It includes the following lyrics:
Quantum computer, Ain’t no loser
Quantum computer, Quantum computerPeople out on the streets
They don’t know what it is
They think it finds the cliques
Or finds graph colorings
But it don’t solve anything
Said it don’t solve anything
Bosonic slot machine
My lil’ photonic dream
Speaking of BosonSampling, A. S. Popova and A. N. Rubtsov, of the Skolkovo Institute in Moscow, have a new preprint entitled Cracking the Quantum Advantage threshold for Gaussian Boson Sampling. In it, they claim to give an efficient classical algorithm to simulate noisy GBS experiments, like the one six months ago from USTC in China. I’m still unsure how well this scales from 30-40 photons up to 50-70 photons; which imperfections of the USTC experiment are primarily being taken advantage of (photon losses?); and how this relates to the earlier proposed classical algorithms for simulating noisy BosonSampling, like the one by Kalai and Kindler. Anyone with any insight is welcome to share!
OK, one last announcement: the Simons Institute for the Theory of Computing, in Berkeley, has a new online lecture series called “Breakthroughs,” which many readers of this blog might want to check out.
We're having an online workshop on "Machine Learning for Algorithms" on July 13-14, with a great group of speakers. Announcement below, link at https://fodsi.us/ml4a.html, free registration (but please register in advance)!
Paper
Project Repo
Documentation and Guides
Blog Demo
In our latest paper, in collaboration with Microsoft Research, we introduce 3DB: an extendable, unified framework for debugging and analyzing vision models using photorealistic simulation. We’re releasing 3DB as a package, accompanied by extensive API documentation, guides, and demos.
Note: You are now viewing the Javascript-free/lightweight version of this post—to see the full version (with interactive plots, diagrams, and models!), click here
Identifying failure modes and biases in vision models is a rapidly emerging challenge in machine learning. In high-stakes applications, simply deploying models and collecting failures that arise in the wild is often difficult, expensive, and irresponsible. To this end, a recent line of work in vision focuses on identifying model failure modes via in-depth analyses of image transformations and corruptions, object orientations, backgrounds, or shape-texture conflicts. These studies (and other similarly important ones) reveal a variety of patterns of performance degradation in vision models. Still, performing each such study requires time, developing (often complex) toolingFor our study of image backgrounds, for example, we used a combination of bounding boxes and classical computer vision tools to crop out image backgrounds. We then had to manually filter out the images for which the tools failed. Even for the images where the toolkit succeeded, there remained inevitable cropping artifacts., and a willingness to settle for less than perfect simulations of each potential failure mode. Our question is: can we support reliable discovery of model failures in a systematic, automated, and unified way?
In our latest paper, we try to make progress on this question and propose 3DB, a platform for automatically identifying and analyzing the failure modes of computer vision models using 3D rendering. 3DB aims to allow users to go from a testable, robustness-based hypothesis to concrete, photorealistic experimental evidence with minimal time and effort.
The platform revolves around the modular workflow pictured below. First, users specify a set of 3D objects and environments, as well as a set of 3D (or 2D) transformations called controls that determine the space of admissible object-environment configurations. 3DB then renders a myriad of admissible scenes and feeds them through the user’s computer vision model of choice. The user can finally stratify, aggregate, or otherwise analyze the results either by reading the outputted JSON, or through the pre-packaged dashboard.
3DB easily adapts to a variety of use cases: in particular, users can modify and swap out any part of this pipeline (e.g., the renderer, the logger, the model type, or the controls) for their own custom-written components, without needing to modify any of the 3DB codebase. We’ve compiled guides, extensive API documentation, and a full demo showing how 3DB streamlines model debugging.
In fact, this blog post will double as another demo! We’ll present the (short) code necessary to reproduce every plot in the post below using 3DB. You can download the aggregated code for this blog post here.
To set up, follow the steps below—then, in the remainder of this post, press “Show/hide code and instructions” to see the steps necessary to reproduce each experiment below.
cd blog_demo, then
bash setup.sh(assumes
unzipis installed) to download a large Blender environment, then
cd ../
curl -L https://git.io/Js8eT | bash /dev/stdin threedb
conda activate threedb
BLENDER_DATAfolder that contains two subfolders:
blender_models/containing 3D models (
.blendfiles with a single object whose name matches the filename), and
blender_environments/containing environments. We will provide you with these later
base.yamland paste in the configuration from the next pane.
inference:
module: 'torchvision.models'
label_map: 'blog_demo/resources/imagenet_mapping.json'
class: 'resnet18'
normalization:
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
resolution: [224, 224]
args:
pretrained: True
evaluation:
module: 'threedb.evaluators.classification'
args:
classmap_path: 'blog_demo/resources/ycb_to_IN.json'
topk: 1
render_args:
engine: 'threedb.rendering.render_blender'
resolution: 256
samples: 16
policy:
module: "threedb.policies.random_search"
samples: 5
logging:
logger_modules:
- "threedb.result_logging.image_logger"
- "threedb.result_logging.json_logger"
Prior works have already used 3D rendering (to great effect) to study biases of machine learning models, including pose and context-based biases. Our goal is not to propose a specific 3D-rendering based analysis, but rather to provide an easy-to-use, highly extendable framework that unifies prior analyses (both 3D and 2D) while enabling users to (a) conduct a host of new analyses with the same ease and with realistic results; and (b) effortlessly compose different factors of variation to understand their interplay.
We’ll dedicate the rest of this post to illustrating how one might actually use 3DB in practice, focusing on a single example 3D model3DB works with any 3D model, and we refer the reader to our paper for more examples and details.:
In what follows, we will walk through example applications of 3DB to discover biases of ML models (some previously documented, others not). For the sake of brevity, we’ll highlight just a few of these (re-)discoveries—to see more, check out the paper. We’ll then demonstrate that the discoveries of 3DB transfer pretty reliably to the real world!
Our experiments will all operate on an ImageNet-pretrained ResNet-18The classifier has a ~70% validation-set accuracy. that has 42% accuracy on images from the “coffee mug” ImageNet subclass. While we only study classification in this blog post, 3DB also supports object detection and can be easily extended to support other image-based tasks, such as semantic segmentation, too.
In one of our previous posts, we continued a long line of prior work (see, e.g., here, here, here, etc.) showing that models can be over-reliant on image backgrounds, and demonstrated that they are easily broken by adversarially chosen backgrounds. To accomplish this, our prior analysis used classical computer vision tools to separate foregrounds from backgrounds, then pasted foregrounds from one image onto backgrounds from another. This process was slow and required extensive quality control to ensure that backgrounds and foregrounds were being extracted properly—and even when they were, a few artifacts remained:
3DB lets us reproduce these findings effortlessly and without introducing such artifacts. To demonstrate this, we use 3DB to render our mug 3D model on hundreds of HDRI backgrounds, resulting in images such as:
We then analyze the performance of a pretrained ResNet-18Recall that this model obtains 42% accuracy on the corresponding "coffee mug" ImageNet class subset.on these images. We find that the performance of the classifier varies significantly across backgrounds, and that accuracy correlates with a crude measure of “background simplicity” (the JPEG compressed size of the image—with smaller size corresponding to being more simple).
A note on compositionality: An important part of 3DB that we don’t discuss here is compositionality, i.e., the ability to put together multiple controls and study their joint effect. For example, in our paper we studied how a model’s prediction vary with various zoom levels and backgrounds of an object. We found that the optimal zoom level varies a lot by background.
# $BLENDER_DATA/blender_environments` contains several backgrounds and
# $BLENDER_DATA/blender_models contains the 3D model of a mug.
export BLENDER_DATA=$(realpath blog_demo)/data/backgrounds
# if you want to use the pre-written material in blog_demo, uncomment:
# cd blog_demo
# (Optional) Download additional backgrounds you want---e.g., from
# https://hdrihaven.com/hdris/ (both `.hdr` and `.blend` files work) and put
# them in BLENDER_DATA/blender_environments.
wget https://hdrihaven.com/hdris/PATH/TO/HDRI \
-O $BLENDER_DATA/blender_environments
# Direct results
export RESULTS_FOLDER='results_backgrounds'
# Run 3DB (with the YAML file from the next pane saved as `backgrounds.yaml`):
threedb_workers 1 $BLENDER_DATA 5555 > client.log &
threedb_master $BLENDER_DATA backgrounds.yaml $RESULTS_FOLDER 5555
# Analyze results in the dashboard
python -m threedboard $RESULTS_FOLDER --port 3000
# Navigate to localhost:3000 to view the results!
# Finally, run analysis using pandas (third pane)
python analyze_bgs.py
base_config: "base.yaml"
policy:
module: "threedb.policies.random_search"
samples: 20
controls:
- module: "threedb.controls.blender.orientation"
- module: "threedb.controls.blender.camera"
zoom_factor: [0.7, 1.3]
aperture: 8.
focal_length: 50.
- module: "threedb.controls.blender.denoiser"
import pandas as pd
import numpy as np
import json
log_lines = open('results_backgrounds/details.log').readlines()
class_map = json.load(open('results_backgrounds/class_maps.json'))
df = pd.DataFrame.from_records(list(map(json.loads, log_lines)))
df['prediction'] = df['prediction'].apply(lambda x: class_map[x[0]])
df['is_correct'] = (df['is_correct'] == 'True')
res = df.groupby('environment').agg(accuracy=('is_correct', 'mean'),
most_frequent_prediction=('prediction', lambda x: x.mode()))
print(res)
Another recent study of neural network biases showed that in contrast to humans, convolutional neural networks (CNNs) rely more on texture to recognize objects than on shape. The example below typifies this phenomenon—a cat with an elephant texture is recognized as a cat by humans, but as an elephant by CNNs:
This example and others like it (dubbed ‘cue-conflict’ images) provide a striking illustration of the contrast between human and CNN-based classification mechanisms. Still, just as in the case of image backgrounds, creating such images typically necessitates time, technical skill, quality control, and/or introduction of unwanted artifacts (for example, in the above figure, ideally we would modify only the texture of the cat without altering the background).
However, using 3DB we can easily collect photorealistic empirical evidence of texture bias. Without modifying the internal 3DB codebase at all, one can write a custom control that modifies the texture of objects in the scene while keeping the rest intact. With this custom control in placeIn fact, the texture-swapping control for this experiment is now pre-packaged with 3DB, since we already wrote it ourselves!, one can simply randomize the texture of the mug across various backgrounds, poses and camera parameters before stratifying results:
The performance of the pretrained model on mugs (and other objects) deteriorates severely upon replacing the mug’s texture with a “wrong” one, providing clear corroborating evidence of the texture bias! We noticed in our experiments that for some textures (e.g., zebra), the coffee mug was consistently misclassified as the corresponding animal, whereas for others (e.g., crocodile), the mug is misclassified as either a related class (e.g., turtle or other reptile), or as an unrelated object (e.g., a trash can).
# ${BLENDER_DATA}/blender_environments contains several backgrounds,
# ${BLENDER_DATA}/blender_models contain the 3D model of a mug.
export BLENDER_DATA=$(realpath blog_demo)/data/texture_swaps
# List the materials that we will use for this post:
ls blog_demo/data/texture_swaps/blender_control_material
# You can also make or download blender materials corresponding
# to other textures you want to test, and add them to that folder
# if you want to use the pre-written material in blog_demo, uncomment:
# cd blog_demo
export RESULTS_FOLDER=results_texture
# Run 3DB (with the YAML file from the next pane saved as texture_swaps.yaml):
threedb_workers 1 $BLENDER_DATA 5555 > client.log &
threedb_master $BLENDER_DATA texture_swaps.yaml $RESULTS_FOLDER 5555
# Analyze results in the dashboard
python -m threedboard $RESULTS_FOLDER --port 3000
# Navigate to localhost:3000 to view the results!
# Finally, run analysis using pandas (copy from third pane)
python analyze_ts.py
base_config: "base.yaml"
controls:
- module: "threedb.controls.blender.orientation"
rotation_x: -1.57
rotation_y: 0.
rotation_z: [-3.14, 3.14]
- module: "threedb.controls.blender.position"
offset_x: 0.
offset_y: 0.5
offset_z: 0.
- module: "threedb.controls.blender.pin_to_ground"
z_ground: 0.25
- module: "threedb.controls.blender.camera"
zoom_factor: [0.7, 1.3]
view_point_x: 1.
view_point_y: 1.
view_point_z: [0., 1.]
aperture: 8.
focal_length: 50.
- module: "threedb.controls.blender.material"
replacement_material: ["cow.blend", "elephant.blend", "zebra.blend", "crocodile.blend", "keep_original"]
- module: "threedb.controls.blender.denoiser"
import pandas as pd
import numpy as np
import json
log_lines = open('results_texture/details.log').readlines()
class_map = json.load(open('results_texture/class_maps.json'))
df = pd.DataFrame.from_records(list(map(json.loads, log_lines)))
df = df.drop('render_args', axis=1).join(pd.DataFrame(df.render_args.values.tolist()))
df['prediction'] = df['prediction'].apply(lambda x: class_map[x[0]])
df['is_correct'] = (df['is_correct'] == 'True')
res = df.groupby('MaterialControl.replacement_material').agg(acc=('is_correct', 'mean'),
most_frequent_prediction=('prediction', lambda x: x.mode()))
print(res)
Beyond general hypotheses about model biases, 3DB allows us to test vision systems on a more fine-grained level. In the case of our running mug example, for instance, we can use the platform to understand which specific parts of its 3D mesh correlate with classifier accuracy. Specifically, below we generate (and classify) scenes with random mug positions, rotations, and backgrounds. Since 3DB stores texture-coordinate information for each rendering, we can reconstruct a three-dimensional heatmap that encodes, for each point on the surface of the mug, the classifier’s accuracy conditioned on that point being visible:
A number of phenomena stand out from this heatmap, including:
# point BLENDER_DATA to the environments and models for this experiment
export BLENDER_DATA=$(realpath blog_demo)/data/part_of_object
# if you want to use the pre-written material in blog_demo, uncomment:
# cd blog_demo
# Optionally: download additional backgrounds (`.hdr` or `.blend`) e.g.,
wget URL -O $BLENDER_DATA/blender_environments/new_env.hdr
# Point results folder to where you want output written
export RESULTS_FOLDER='results_part_of_object'
# Run 3DB (with the YAML file from the next pane):
threedb_workers 1 $BLENDER_DATA 5555 > client.log &
threedb_master $BLENDER_DATA part_of_object.yaml $RESULTS_FOLDER 5555
# Analyze results in the dashboard
python -m threedboard $RESULTS_FOLDER --port 3000
# Navigate to localhost:3000 to view the results!
# Run `part_of_object.py` (third pane) to generate the heat map of the mug.
python po_analysis.py
base_config: "base.yaml"
policy:
module: "threedb.policies.random_search"
samples: 20
render_args:
engine: 'threedb.rendering.render_blender'
resolution: 256
samples: 16
with_uv: True
controls:
- module: "threedb.controls.blender.orientation"
rotation_x: -1.57
rotation_y: 0.
rotation_z: [-3.14, 3.14]
- module: "threedb.controls.blender.camera"
zoom_factor: [0.7, 1.3]
view_point_x: 1.
view_point_y: 1.
view_point_z: 1.
aperture: 8.
focal_length: 50.
- module: "threedb.controls.blender.denoiser"
- module: "threedb.controls.blender.background"
H: 1.
S: 0.
V: 1.
import pandas as pd
import numpy as np
import json
from PIL import Image
DIR = 'results_part_of_object'
log_lines = open(f'{DIR}/details.log').readlines()
df = pd.DataFrame.from_records(list(map(json.loads, log_lines)))
# From class index to class name (for readability)
class_map = json.load(open(f'{DIR}/class_maps.json'))
df['prediction'] = df['prediction'].apply(lambda x: class_map[x[0]])
# We'll be a little lenient here to get a more interesting heatmap
df['is_correct'] = df['prediction'].isin(['cup', 'coffee mug'])
uv_num_correct = np.zeros((256, 256))
uv_num_visible = np.zeros((256, 256))
for imid in df["id"].unique().tolist():
is_correct = float(df.set_index('id').loc[imid]['is_correct'])
vis_coords_im = Image.open(f'{DIR}/images/{imid}_uv.png')
vis_coords = np.array(vis_coords_im).reshape(-1, 3)
# R and G channels encode texture coordinates (x, y),
# B channel is 255 for object and 0 for background
# So we will filter by B then only look at R and G.
vis_coords = vis_coords[vis_coords[:,2] > 0][:,:2]
uv_num_visible[vis_coords[:,0], vis_coords[:,1]] += 1.
uv_num_correct[vis_coords[:,0], vis_coords[:,1]] += is_correct
# Accuracy = # correct / # visible
uv_accuracy = uv_num_correct / (uv_num_visible + 1e-4)
# Saves a black-and-white heatmap
Image.fromarray((255 * uv_accuracy).astype('uint8'))
Now that we have hypotheses regarding model performance, we can test them! Inspecting the ImageNet validation set, we found that our classifier indeed (a) struggles on coffee mugs when the handle is not showing (providing a feasible explanation for (1), since the side opposite the handle is only visible when the handle itself isn’t), and (b) performs worse at higher camera angles (providing a plausible explanation for (2)). We want to focus, however, on the third phenomenon, i.e., that the classifier performs quite poorly whenever the inside of the mug is visible. Why could this be the case? We can use 3DB to gain insight into the phenomenon. Specifically, we want to test the following hypothesis: when classifying mugs, does our ImageNet model rely on the exact liquid inside the cup?
We investigate this hypothesis by writing a custom control that fills our mug with various liquids (more precisely, a parameterized mixture of water, milk, and coffee):
In contrast to the last experiment (where we varied the orientation of the mug), we render scenes containing the mug in a fixed set of poses that reveal the contents—just as in the last experiment, however, we still vary background and mug location. We visualize the results below—each cell in the heatmap corresponds to a fixed mixture of coffee, water, and milk (i.e., the labeled corners are 100% coffee, 100% milk, and 100% water, and the other cells are linear interpolations of these ratios) and the color of the cell encodes the relative accuracy of the classifier when the mug is filled with that liquid:
It turns out that mug content indeed highly impacts classification: our model is much less likely to correctly classify a mug that doesn’t contain coffee! This is just one example of how 3DB can help in proving or disproving hypotheses about model behavior.
So far, we’ve used 3DB to discover ML models’ various failure modes and biases via photorealistic rendering. To what extent though do the insights gleaned from simulated 3DB experiments actually “transfer” to the physical world?
To test such transferability, we began by creating a 3D model of a physical room we had access to. We also collected eight different 3D models with closely matching physical world counterparts—including the mug analyzed above. Next, we used 3DB to find correctly and incorrectly classified configurations (pose, orientation, location) of these eight objects inside that room. Finally, we replicated these poses (to the best of our abilities) in the physical room, and took photos with a cellphone camera:
We classified these photos with the same vision model as before and measured how often the simulated classifier correctness matched correctness on the real photographs. We observed an ~85% match! So the failure modes identified by 3DB are not merely simulation artifacts, and can indeed arise in the real world.
3DB is a flexible, easy-to-use, and extensible framework for identifying model failure modes, uncovering biases, and testing fine-grained hypotheses about model behavior. We hope it will prove to be a useful tool for debugging vision models.
We’ll wrap up by highlighting some additional capabilities of 3DB that we didn’t get to demonstrate in this blog post:
In all of the code examples above, we showed how to analyze the results of a 3DB
experiment by loading the output into a pandas
dataframe. For additional
convenience, however, 3DB also comes with a web-based dashboard for exploring
experimental results. The following command suffices to visualize the texture swaps experiment from earlier:
python -m threedboard results_texture/ --port 3000
Navigating to YOUR_IP:3000
should lead you to a page that looks like this:
In this blog post, we focused on using 3DB to analyze image classification models. However, the library also supports object detection out-of-the-box, and can easily be extended to support a variety of image-based tasks (e.g., segmentation or regression-based tasks). For example, below we provide a simple end-to-end object detection example:
# The object detection example is separate from the rest of the blog demo, so
# run the following in a separate repo:
git clone https://github.com/3db/object_detection_demo
# The repo has a data/ folder containing the Blender model (a banana) and some
# HDRI backgrounds, a classmap.json file mapping the UID of the model to a COCO
# class, and the detection.yaml file from the next pane.
cd object_detection_demo/
export BLENDER_DATA=data/
export RESULTS_FOLDER=results/
# Run 3DB
threedb_workers 1 $BLENDER_DATA 5555 > client.log &
threedb_master $BLENDER_DATA detection.yaml $RESULTS_FOLDER 5555
# Analyze results in the dashboard
python -m threedboard $RESULTS_FOLDER --port 3000
# Navigate to localhost:3000 to view the results!
inference:
module: 'torchvision.models.detection'
class: 'retinanet_resnet50_fpn'
label_map: './resources/coco_mapping.json'
normalization:
mean: [0., 0., 0.]
std: [1., 1., 1.]
resolution: [224, 224]
args:
pretrained: True
evaluation:
module: 'threedb.evaluators.detection'
args:
iou_threshold: 0.5
nms_threshold: 0.1
max_num_boxes: 10
classmap_path: 'classmap.json'
render_args:
engine: 'threedb.rendering.render_blender'
resolution: 256
samples: 16
with_segmentation: true
policy:
module: "threedb.policies.random_search"
samples: 2
logging:
logger_modules:
- "threedb.result_logging.image_logger"
- "threedb.result_logging.json_logger"
controls:
- module: "threedb.controls.blender.orientation"
- module: "threedb.controls.blender.camera"
zoom_factor: [0.7, 1.3]
aperture: 8.
focal_length: 50.
- module: "threedb.controls.blender.denoiser"
Paper
Project Repo
Documentation and Guides
Blog Demo
In our latest paper, in collaboration with Microsoft Research, we introduce 3DB: an extendable, unified framework for debugging and analyzing vision models using photorealistic simulation. We’re releasing 3DB as a package, accompanied by extensive API documentation, guides, and demos.
Note: this post contains some interactive plots and 3D models that use JavaScript: click here for a JS-free version of this post.
Identifying failure modes and biases in vision models is a rapidly emerging challenge in machine learning. In high-stakes applications, simply deploying models and collecting failures that arise in the wild is often difficult, expensive, and irresponsible. To this end, a recent line of work in vision focuses on identifying model failure modes via in-depth analyses of image transformations and corruptions, object orientations, backgrounds, or shape-texture conflicts. These studies (and other similarly important ones) reveal a variety of patterns of performance degradation in vision models. Still, performing each such study requires time, developing (often complex) toolingFor our study of image backgrounds, for example, we used a combination of bounding boxes and classical computer vision tools to crop out image backgrounds. We then had to manually filter out the images for which the tools failed. Even for the images where the toolkit succeeded, there remained inevitable cropping artifacts., and a willingness to settle for less than perfect simulations of each potential failure mode. Our question is: can we support reliable discovery of model failures in a systematic, automated, and unified way?
In our latest paper, we try to make progress on this question and propose 3DB, a platform for automatically identifying and analyzing the failure modes of computer vision models using 3D rendering. 3DB aims to allow users to go from a testable, robustness-based hypothesis to concrete, photorealistic experimental evidence with minimal time and effort.
The platform revolves around the modular workflow pictured below. First, users specify a set of 3D objects and environments, as well as a set of 3D (or 2D) transformations called controls that determine the space of admissible object-environment configurations. 3DB then renders a myriad of admissible scenes and feeds them through the user’s computer vision model of choice. The user can finally stratify, aggregate, or otherwise analyze the results either by reading the outputted JSON, or through the pre-packaged dashboard.
3DB easily adapts to a variety of use cases: in particular, users can modify and swap out any part of this pipeline (e.g., the renderer, the logger, the model type, or the controls) for their own custom-written components, without needing to modify any of the 3DB codebase. We’ve compiled guides, extensive API documentation, and a full demo showing how 3DB streamlines model debugging.
In fact, this blog post will double as another demo! We’ll present the (short) code necessary to reproduce every plot in the post below using 3DB. You can download the aggregated code for this blog post here.
To set up, follow the steps below—then, in the remainder of this post, press “Show/hide code and instructions” to see the steps necessary to reproduce each experiment below.
cd blog_demo, then
bash setup.sh(assumes
unzipis installed) to download a large Blender environment, then
cd ../
curl -L https://git.io/Js8eT | bash /dev/stdin threedb
conda activate threedb
BLENDER_DATAfolder that contains two subfolders:
blender_models/containing 3D models (
.blendfiles with a single object whose name matches the filename), and
blender_environments/containing environments. We will provide you with these later
base.yamland paste in the configuration from the next pane.
inference:
module: 'torchvision.models'
label_map: 'blog_demo/resources/imagenet_mapping.json'
class: 'resnet18'
normalization:
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
resolution: [224, 224]
args:
pretrained: True
evaluation:
module: 'threedb.evaluators.classification'
args:
classmap_path: 'blog_demo/resources/ycb_to_IN.json'
topk: 1
render_args:
engine: 'threedb.rendering.render_blender'
resolution: 256
samples: 16
policy:
module: "threedb.policies.random_search"
samples: 5
logging:
logger_modules:
- "threedb.result_logging.image_logger"
- "threedb.result_logging.json_logger"
Prior works have already used 3D rendering (to great effect) to study biases of machine learning models, including pose and context-based biases. Our goal is not to propose a specific 3D-rendering based analysis, but rather to provide an easy-to-use, highly extendable framework that unifies prior analyses (both 3D and 2D) while enabling users to (a) conduct a host of new analyses with the same ease and with realistic results; and (b) effortlessly compose different factors of variation to understand their interplay.
We’ll dedicate the rest of this post to illustrating how one might actually use 3DB in practice, focusing on a single example 3D model3DB works with any 3D model, and we refer the reader to our paper for more examples and details.:
In what follows, we will walk through example applications of 3DB to discover biases of ML models (some previously documented, others not). For the sake of brevity, we’ll highlight just a few of these (re-)discoveries—to see more, check out the paper. We’ll then demonstrate that the discoveries of 3DB transfer pretty reliably to the real world!
Our experiments will all operate on an ImageNet-pretrained ResNet-18The classifier has a ~70% validation-set accuracy. that has 42% accuracy on images from the “coffee mug” ImageNet subclass. While we only study classification in this blog post, 3DB also supports object detection and can be easily extended to support other image-based tasks, such as semantic segmentation, too.
In one of our previous posts, we continued a long line of prior work (see, e.g., here, here, here, etc.) showing that models can be over-reliant on image backgrounds, and demonstrated that they are easily broken by adversarially chosen backgrounds. To accomplish this, our prior analysis used classical computer vision tools to separate foregrounds from backgrounds, then pasted foregrounds from one image onto backgrounds from another. This process was slow and required extensive quality control to ensure that backgrounds and foregrounds were being extracted properly—and even when they were, a few artifacts remained:
3DB lets us reproduce these findings effortlessly and without introducing such artifacts. To demonstrate this, we use 3DB to render our mug 3D model on hundreds of HDRI backgrounds, resulting in images such as:
We then analyze the performance of a pretrained ResNet-18Recall that this model obtains 42% accuracy on the corresponding "coffee mug" ImageNet class subset.on these images. We find that the performance of the classifier varies significantly across backgrounds, and that accuracy correlates with a crude measure of “background simplicity” (the JPEG compressed size of the image—with smaller size corresponding to being more simple).
A note on compositionality: An important part of 3DB that we don’t discuss here is compositionality, i.e., the ability to put together multiple controls and study their joint effect. For example, in our paper we studied how a model’s prediction vary with various zoom levels and backgrounds of an object. We found that the optimal zoom level varies a lot by background.
# $BLENDER_DATA/blender_environments` contains several backgrounds and
# $BLENDER_DATA/blender_models contains the 3D model of a mug.
export BLENDER_DATA=$(realpath blog_demo)/data/backgrounds
# if you want to use the pre-written material in blog_demo, uncomment:
# cd blog_demo
# (Optional) Download additional backgrounds you want---e.g., from
# https://hdrihaven.com/hdris/ (both `.hdr` and `.blend` files work) and put
# them in BLENDER_DATA/blender_environments.
wget https://hdrihaven.com/hdris/PATH/TO/HDRI \
-O $BLENDER_DATA/blender_environments
# Direct results
export RESULTS_FOLDER='results_backgrounds'
# Run 3DB (with the YAML file from the next pane saved as `backgrounds.yaml`):
threedb_workers 1 $BLENDER_DATA 5555 > client.log &
threedb_master $BLENDER_DATA backgrounds.yaml $RESULTS_FOLDER 5555
# Analyze results in the dashboard
python -m threedboard $RESULTS_FOLDER --port 3000
# Navigate to localhost:3000 to view the results!
# Finally, run analysis using pandas (third pane)
python analyze_bgs.py
base_config: "base.yaml"
policy:
module: "threedb.policies.random_search"
samples: 20
controls:
- module: "threedb.controls.blender.orientation"
- module: "threedb.controls.blender.camera"
zoom_factor: [0.7, 1.3]
aperture: 8.
focal_length: 50.
- module: "threedb.controls.blender.denoiser"
import pandas as pd
import numpy as np
import json
log_lines = open('results_backgrounds/details.log').readlines()
class_map = json.load(open('results_backgrounds/class_maps.json'))
df = pd.DataFrame.from_records(list(map(json.loads, log_lines)))
df['prediction'] = df['prediction'].apply(lambda x: class_map[x[0]])
df['is_correct'] = (df['is_correct'] == 'True')
res = df.groupby('environment').agg(accuracy=('is_correct', 'mean'),
most_frequent_prediction=('prediction', lambda x: x.mode()))
print(res)
Another recent study of neural network biases showed that in contrast to humans, convolutional neural networks (CNNs) rely more on texture to recognize objects than on shape. The example below typifies this phenomenon—a cat with an elephant texture is recognized as a cat by humans, but as an elephant by CNNs:
This example and others like it (dubbed ‘cue-conflict’ images) provide a striking illustration of the contrast between human and CNN-based classification mechanisms. Still, just as in the case of image backgrounds, creating such images typically necessitates time, technical skill, quality control, and/or introduction of unwanted artifacts (for example, in the above figure, ideally we would modify only the texture of the cat without altering the background).
However, using 3DB we can easily collect photorealistic empirical evidence of texture bias. Without modifying the internal 3DB codebase at all, one can write a custom control that modifies the texture of objects in the scene while keeping the rest intact. With this custom control in placeIn fact, the texture-swapping control for this experiment is now pre-packaged with 3DB, since we already wrote it ourselves!, one can simply randomize the texture of the mug across various backgrounds, poses and camera parameters before stratifying results:
The performance of the pretrained model on mugs (and other objects) deteriorates severely upon replacing the mug’s texture with a “wrong” one, providing clear corroborating evidence of the texture bias! We noticed in our experiments that for some textures (e.g., zebra), the coffee mug was consistently misclassified as the corresponding animal, whereas for others (e.g., crocodile), the mug is misclassified as either a related class (e.g., turtle or other reptile), or as an unrelated object (e.g., a trash can).
# ${BLENDER_DATA}/blender_environments contains several backgrounds,
# ${BLENDER_DATA}/blender_models contain the 3D model of a mug.
export BLENDER_DATA=$(realpath blog_demo)/data/texture_swaps
# List the materials that we will use for this post:
ls blog_demo/data/texture_swaps/blender_control_material
# You can also make or download blender materials corresponding
# to other textures you want to test, and add them to that folder
# if you want to use the pre-written material in blog_demo, uncomment:
# cd blog_demo
export RESULTS_FOLDER=results_texture
# Run 3DB (with the YAML file from the next pane saved as texture_swaps.yaml):
threedb_workers 1 $BLENDER_DATA 5555 > client.log &
threedb_master $BLENDER_DATA texture_swaps.yaml $RESULTS_FOLDER 5555
# Analyze results in the dashboard
python -m threedboard $RESULTS_FOLDER --port 3000
# Navigate to localhost:3000 to view the results!
# Finally, run analysis using pandas (copy from third pane)
python analyze_ts.py
base_config: "base.yaml"
controls:
- module: "threedb.controls.blender.orientation"
rotation_x: -1.57
rotation_y: 0.
rotation_z: [-3.14, 3.14]
- module: "threedb.controls.blender.position"
offset_x: 0.
offset_y: 0.5
offset_z: 0.
- module: "threedb.controls.blender.pin_to_ground"
z_ground: 0.25
- module: "threedb.controls.blender.camera"
zoom_factor: [0.7, 1.3]
view_point_x: 1.
view_point_y: 1.
view_point_z: [0., 1.]
aperture: 8.
focal_length: 50.
- module: "threedb.controls.blender.material"
replacement_material: ["cow.blend", "elephant.blend", "zebra.blend", "crocodile.blend", "keep_original"]
- module: "threedb.controls.blender.denoiser"
import pandas as pd
import numpy as np
import json
log_lines = open('results_texture/details.log').readlines()
class_map = json.load(open('results_texture/class_maps.json'))
df = pd.DataFrame.from_records(list(map(json.loads, log_lines)))
df = df.drop('render_args', axis=1).join(pd.DataFrame(df.render_args.values.tolist()))
df['prediction'] = df['prediction'].apply(lambda x: class_map[x[0]])
df['is_correct'] = (df['is_correct'] == 'True')
res = df.groupby('MaterialControl.replacement_material').agg(acc=('is_correct', 'mean'),
most_frequent_prediction=('prediction', lambda x: x.mode()))
print(res)
Beyond general hypotheses about model biases, 3DB allows us to test vision systems on a more fine-grained level. In the case of our running mug example, for instance, we can use the platform to understand which specific parts of its 3D mesh correlate with classifier accuracy. Specifically, below we generate (and classify) scenes with random mug positions, rotations, and backgrounds. Since 3DB stores texture-coordinate information for each rendering, we can reconstruct a three-dimensional heatmap that encodes, for each point on the surface of the mug, the classifier’s accuracy conditioned on that point being visible:
A number of phenomena stand out from this heatmap, including:
# point BLENDER_DATA to the environments and models for this experiment
export BLENDER_DATA=$(realpath blog_demo)/data/part_of_object
# if you want to use the pre-written material in blog_demo, uncomment:
# cd blog_demo
# Optionally: download additional backgrounds (`.hdr` or `.blend`) e.g.,
wget URL -O $BLENDER_DATA/blender_environments/new_env.hdr
# Point results folder to where you want output written
export RESULTS_FOLDER='results_part_of_object'
# Run 3DB (with the YAML file from the next pane):
threedb_workers 1 $BLENDER_DATA 5555 > client.log &
threedb_master $BLENDER_DATA part_of_object.yaml $RESULTS_FOLDER 5555
# Analyze results in the dashboard
python -m threedboard $RESULTS_FOLDER --port 3000
# Navigate to localhost:3000 to view the results!
# Run `part_of_object.py` (third pane) to generate the heat map of the mug.
python po_analysis.py
base_config: "base.yaml"
policy:
module: "threedb.policies.random_search"
samples: 20
render_args:
engine: 'threedb.rendering.render_blender'
resolution: 256
samples: 16
with_uv: True
controls:
- module: "threedb.controls.blender.orientation"
rotation_x: -1.57
rotation_y: 0.
rotation_z: [-3.14, 3.14]
- module: "threedb.controls.blender.camera"
zoom_factor: [0.7, 1.3]
view_point_x: 1.
view_point_y: 1.
view_point_z: 1.
aperture: 8.
focal_length: 50.
- module: "threedb.controls.blender.denoiser"
- module: "threedb.controls.blender.background"
H: 1.
S: 0.
V: 1.
import pandas as pd
import numpy as np
import json
from PIL import Image
DIR = 'results_part_of_object'
log_lines = open(f'{DIR}/details.log').readlines()
df = pd.DataFrame.from_records(list(map(json.loads, log_lines)))
# From class index to class name (for readability)
class_map = json.load(open(f'{DIR}/class_maps.json'))
df['prediction'] = df['prediction'].apply(lambda x: class_map[x[0]])
# We'll be a little lenient here to get a more interesting heatmap
df['is_correct'] = df['prediction'].isin(['cup', 'coffee mug'])
uv_num_correct = np.zeros((256, 256))
uv_num_visible = np.zeros((256, 256))
for imid in df["id"].unique().tolist():
is_correct = float(df.set_index('id').loc[imid]['is_correct'])
vis_coords_im = Image.open(f'{DIR}/images/{imid}_uv.png')
vis_coords = np.array(vis_coords_im).reshape(-1, 3)
# R and G channels encode texture coordinates (x, y),
# B channel is 255 for object and 0 for background
# So we will filter by B then only look at R and G.
vis_coords = vis_coords[vis_coords[:,2] > 0][:,:2]
uv_num_visible[vis_coords[:,0], vis_coords[:,1]] += 1.
uv_num_correct[vis_coords[:,0], vis_coords[:,1]] += is_correct
# Accuracy = # correct / # visible
uv_accuracy = uv_num_correct / (uv_num_visible + 1e-4)
# Saves a black-and-white heatmap
Image.fromarray((255 * uv_accuracy).astype('uint8'))
Now that we have hypotheses regarding model performance, we can test them! Inspecting the ImageNet validation set, we found that our classifier indeed (a) struggles on coffee mugs when the handle is not showing (providing a feasible explanation for (1), since the side opposite the handle is only visible when the handle itself isn’t), and (b) performs worse at higher camera angles (providing a plausible explanation for (2)). We want to focus, however, on the third phenomenon, i.e., that the classifier performs quite poorly whenever the inside of the mug is visible. Why could this be the case? We can use 3DB to gain insight into the phenomenon. Specifically, we want to test the following hypothesis: when classifying mugs, does our ImageNet model rely on the exact liquid inside the cup?
We investigate this hypothesis by writing a custom control that fills our mug with various liquids (more precisely, a parameterized mixture of water, milk, and coffee):
In contrast to the last experiment (where we varied the orientation of the mug), we render scenes containing the mug in a fixed set of poses that reveal the contents—just as in the last experiment, however, we still vary background and mug location. We visualize the results below—each cell in the heatmap corresponds to a fixed mixture of coffee, water, and milk (i.e., the labeled corners are 100% coffee, 100% milk, and 100% water, and the other cells are linear interpolations of these ratios) and the color of the cell encodes the relative accuracy of the classifier when the mug is filled with that liquid:
It turns out that mug content indeed highly impacts classification: our model is much less likely to correctly classify a mug that doesn’t contain coffee! This is just one example of how 3DB can help in proving or disproving hypotheses about model behavior.
So far, we’ve used 3DB to discover ML models’ various failure modes and biases via photorealistic rendering. To what extent though do the insights gleaned from simulated 3DB experiments actually “transfer” to the physical world?
To test such transferability, we began by creating a 3D model of a physical room we had access to. We also collected eight different 3D models with closely matching physical world counterparts—including the mug analyzed above. Next, we used 3DB to find correctly and incorrectly classified configurations (pose, orientation, location) of these eight objects inside that room. Finally, we replicated these poses (to the best of our abilities) in the physical room, and took photos with a cellphone camera:
We classified these photos with the same vision model as before and measured how often the simulated classifier correctness matched correctness on the real photographs. We observed an ~85% match! So the failure modes identified by 3DB are not merely simulation artifacts, and can indeed arise in the real world.
3DB is a flexible, easy-to-use, and extensible framework for identifying model failure modes, uncovering biases, and testing fine-grained hypotheses about model behavior. We hope it will prove to be a useful tool for debugging vision models.
We’ll wrap up by highlighting some additional capabilities of 3DB that we didn’t get to demonstrate in this blog post:
In all of the code examples above, we showed how to analyze the results of a 3DB
experiment by loading the output into a pandas
dataframe. For additional
convenience, however, 3DB also comes with a web-based dashboard for exploring
experimental results. The following command suffices to visualize the texture swaps experiment from earlier:
python -m threedboard results_texture/ --port 3000
Navigating to YOUR_IP:3000
should lead you to a page that looks like this:
In this blog post, we focused on using 3DB to analyze image classification models. However, the library also supports object detection out-of-the-box, and can easily be extended to support a variety of image-based tasks (e.g., segmentation or regression-based tasks). For example, below we provide a simple end-to-end object detection example:
# The object detection example is separate from the rest of the blog demo, so
# run the following in a separate repo:
git clone https://github.com/3db/object_detection_demo
# The repo has a data/ folder containing the Blender model (a banana) and some
# HDRI backgrounds, a classmap.json file mapping the UID of the model to a COCO
# class, and the detection.yaml file from the next pane.
cd object_detection_demo/
export BLENDER_DATA=data/
export RESULTS_FOLDER=results/
# Run 3DB
threedb_workers 1 $BLENDER_DATA 5555 > client.log &
threedb_master $BLENDER_DATA detection.yaml $RESULTS_FOLDER 5555
# Analyze results in the dashboard
python -m threedboard $RESULTS_FOLDER --port 3000
# Navigate to localhost:3000 to view the results!
inference:
module: 'torchvision.models.detection'
class: 'retinanet_resnet50_fpn'
label_map: './resources/coco_mapping.json'
normalization:
mean: [0., 0., 0.]
std: [1., 1., 1.]
resolution: [224, 224]
args:
pretrained: True
evaluation:
module: 'threedb.evaluators.detection'
args:
iou_threshold: 0.5
nms_threshold: 0.1
max_num_boxes: 10
classmap_path: 'classmap.json'
render_args:
engine: 'threedb.rendering.render_blender'
resolution: 256
samples: 16
with_segmentation: true
policy:
module: "threedb.policies.random_search"
samples: 2
logging:
logger_modules:
- "threedb.result_logging.image_logger"
- "threedb.result_logging.json_logger"
controls:
- module: "threedb.controls.blender.orientation"
- module: "threedb.controls.blender.camera"
zoom_factor: [0.7, 1.3]
aperture: 8.
focal_length: 50.
- module: "threedb.controls.blender.denoiser"
The Simons institute started a new virtual seminar series highlighting recent advances in theoretical computer science. The first two talks in the series will be:
ICML 2021, one of the biggest conferences in machine learning, naturally has a ton of interesting sounding papers on the topic of differential privacy. We went through this year’s accepted papers and aggregated all the relevant papers we could find. In addition, this year features three workshops on the topic of privacy, as well as a tutorial. As always, please inform us if we overlooked any papers on differential privacy.
Federated Learning for User Privacy and Data Confidentiality
Machine Learning for Data: Automated Creation, Privacy, Bias
A Framework for Private Matrix Analysis in Sliding Window Model
Jalaj Upadhyay, Sarvagya Upadhyay
Accuracy, Interpretability, and Differential Privacy via Explainable Boosting
Harsha Nori, Rich Caruana, Zhiqi Bu, Judy Hanwen Shen, Janardhan Kulkarni
Differentially Private Aggregation in the Shuffle Model: Almost Central Accuracy in Almost a Single Message
Badih Ghazi, Ravi Kumar, Pasin Manurangsi, Rasmus Pagh, Amer Sinha
Differentially Private Bayesian Inference for Generalized Linear Models
Tejas Kulkarni, Joonas Jälkö, Antti Koskela, Samuel Kaski, Antti Honkela
Differentially-Private Clustering of Easy Instances
Edith Cohen, Haim Kaplan, Yishay Mansour, Uri Stemmer, Eliad Tsfadia
Differentially Private Correlation Clustering
Mark Bun, Marek Elias, Janardhan Kulkarni
Differentially Private Densest Subgraph Detection
Dung Nguyen, Anil Vullikanti
Differentially Private Quantiles
Jennifer Gillenwater, Matthew Joseph, Alex Kulesza
Differentially Private Query Release Through Adaptive Projection
Sergul Aydore, William Brown, Michael Kearns, Krishnaram Kenthapadi, Luca Melis, Aaron Roth, Ankit Siva
Differentially Private Sliced Wasserstein Distance
Alain Rakotomamonjy, Liva Ralaivola
Large Scale Private Learning via Low-rank Reparametrization
Da Yu, Huishuai Zhang, Wei Chen, Jian Yin, Tie-Yan Liu
Leveraging Public Data for Practical Private Query Release
Terrance Liu, Giuseppe Vietri, Thomas Steinke, Jonathan Ullman, Steven Wu
Locally Private k-Means in One Round
Alisa Chang, Badih Ghazi, Ravi Kumar, Pasin Manurangsi
Lossless Compression of Efficient Private Local Randomizers
Vitaly Feldman, Kunal Talwar
Oneshot Differentially Private Top-k Selection
Gang Qiao, Weijie Su, Li Zhang
PAPRIKA: Private Online False Discovery Rate Control
Wanrong Zhang, Gautam Kamath, Rachel Cummings
Practical and Private (Deep) Learning without Sampling or Shuffling
Peter Kairouz, Brendan McMahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, Zheng Xu
Private Adaptive Gradient Methods for Convex Optimization
Hilal Asi, John Duchi, Alireza Fallah, Omid Javidbakht, Kunal Talwar
Private Alternating Least Squares: (Nearly) Optimal Privacy/Utility Trade-off for Matrix Completion
Steve Chien, Prateek Jain, Walid Krichene, Steffen Rendle, Shuang Song, Abhradeep Thakurta, Li Zhang
Private Stochastic Convex Optimization: Optimal Rates in L1 Geometry
Hilal Asi, Vitaly Feldman, Tomer Koren, Kunal Talwar
The Distributed Discrete Gaussian Mechanism for Federated Learning with Secure Aggregation
Peter Kairouz, Ziyu Liu, Thomas Steinke
I have recently posted the paper [Vio21] (download) which does something that I have been trying to do for a long time, more than ten years, on and off. Consider the basic data-structure problem of storing bits of data into bits so that the prefix-sum queries
can be computed by probing cells (or words) of bits each. (You can think throughout this post.) The paper [PV10] with Pǎtraşcu shows that , and this was recently shown to be tight by Yu [Yu19] (building on the breakthrough data structure [Pǎt08] which motivated the lower bound and is not far from it).
As is common in data-structure lower bounds, the proof in [PV10] is an encoding argument. In the recently posted paper, an alternative proof is presented which avoids the encoding argument and is perhaps more in line with other proofs in complexity lower bounds. Of course, everything is an encoding argument, and nothing is an encoding argument, and this post won’t draw a line.
The new proof establishes an intrinsic property of efficient data structures, whereas typical proofs including [PV10] are somewhat tailored to the problem at hand. The property is called the separator and is a main technical contribution of the work. At the high level the separator shows that in any efficient data structure you can restrict the input space a little so that many queries are nearly pairwise independent.
Also, the new proof rules out a stronger object: a sampler (see previous post here on sampling lower bounds). Specifically, the distribution Rank where is the uniform distribution cannot be sampled, not even slightly close, by an efficient cell-probe algorithm. This implies the data-structure result, and it can be informally interpreted as saying that the “reason” why the lower bound holds is not that the data is compressed, but rather that one can’t generate the type of dependencies occurring in Rank via an efficient cell-probe algorithm, regardless of what the input is.
Building on this machinery, one can prove several results about sampling, like showing that cell-probe samplers are strictly weaker than AC0 samplers. While doing this, it occurred to me that one gets a corollary for data structures which I had not seen in the literature. The corollary is a probe hierarchy, showing that some problem can be solved with zero redundancy () with probes, while it requires almost linear for probes. For example I don’t know of a result yielding this for small such as ; I would appreciate a reference. (As mentioned in the paper, the sampling viewpoint is not essential and just like for Rank one can prove the data-structure corollaries directly. Personally, and obviously, I find the sampling viewpoint useful.)
One of my favorite open problems in the area still is: can a uniform distribution over be approximately sampled by an efficient cell-probe algorithm? I can’t even rule out samplers making two probes!
[Pǎt08] Mihai Pǎtraşcu. Succincter. In 49th IEEE Symp. on Foundations of Computer Science (FOCS). IEEE, 2008.
[PV10] Mihai Pǎtraşcu and Emanuele Viola. Cell-probe lower bounds for succinct partial sums. In 21th ACM-SIAM Symp. on Discrete Algorithms (SODA), pages 117–122, 2010.
[Vio21] Emanuele Viola. Lower bounds for samplers and data structures via the cell-probe separator. Available at http://www.ccs.neu.edu/home/viola/, 2021.
You know, I loved math. My mom was a math teacher—Joan Cusack
Mary Kay Farley, my dear wife’s mom, and Dorothy Lipton, my mom, have unfortunately both passed away. Kathryn and I miss them greatly. Both women shared keen mathematical skills, a fascination with the game of baseball and a commitment to living a well-ordered life.
Today is not Mother’s Day. We still hope all mothers everywhere are enjoying their day.
We will take this time to thank all of you out there. We missed doing so last month, but the pandemic has distended time anyway. What we are hearing now are stories of mothers and children and grandchildren finally being able to think of seeing each other in person rather than via video.
This has blended with musings on our recent post in which I (Dick) noted that Dorit Aharonov is an academic grandchild of mine, in that Avi Wigderson co-supervised her doctoral thesis and I supervised Avi’s.
Years ago we featured on Father’s Day a post with the title “Who’s Your Doktorvater?”—which was a play on the expression “who’s your daddy?” Now it is high time to note that there are many “doctor mothers”—as Dorit has herself become.
One difference from human genealogy is that most often there is only one “doctor parent.” My advisor, David Parnas, has two: Alan Perlis and Everard Williams. From Perlis it is a straight shot back to Siméon Poisson, whose 1800 dissertation was co-advised by Joseph Lagrange and Pierre Laplace. For Lagrange there is a strange note of Leonhard Euler as a virtual advisor, but the real one is Giovanni Beccaria—who has no listed parent. Going through Laplace also dead-ends. But selecting Euler includes a chain that ends in the 1100s with Sharaf al-Dīn al-Ṭūsī, who improved the complexity of approximately solving cubic equations.
I appear not to have any female ancestors in my doctoral genealogy. I have two female PhD graduates, one of whom is a Doktormutter. Ken’s first female doctoral student, co-advised, had a successful thesis defense last week; he has another nearing the ABD stage. But I have known quite a few other “doctor mothers” personally. Today, Ken and I thought to recognize them.
Here are some that I have had the honor to know. They are in a certain order—do you see what it is? I give only the surname on purpose—click the second name and note its URL for a singular reflection of this.
The last gives us an all-female tree, not just one branch, of people we know. Besides Anna Gilbert, another of Ingrid Daubechies’s students, who herself has advisees, is Cynthia Rudin of Duke, whom Ken knew and taught while she was an undergraduate at Buffalo.
There are others I could mention who went into research labs where there are different relationships besides PhD advising. They include Irene Greif, Tal Rabin, Lynn Conway, and Jean Sammet. I could include Jamie Morgenstern, whom we recently featured and who his advising her first students at the University of Washington—do they have to be “born” yet to count you as a Doktormutter? I’ve left others out—apologies for that—but the ones I’ve listed make a nice collage:
Of these, the one with the most formative impact on me was Helena Rasiowa. I learned advanced logic from her when I was an undergraduate.
Here is a tribute to her by Melvin Fitting:
I once heard Dana Scott criticize her book, The Mathematics of Metamathematics with Roman Sikorski, because, while it took an algebraic approach to logic, it did not carry the work further and consider set theory. If it had, then forcing would have been discovered years earlier than it was. This is not, at heart, a criticism, but a tribute. The building of mathematics always goes on. Foundations, firmly laid, enable later construction, and the foundations laid by that book were powerfully firm.
Ken also points to logic as a formative influence—though from men at Oxford. Both of us were attracted to Gödel-type undecidability issues in complexity theory in the early 1980s. The nexus of logic and algebra has been important to us in different ways, Ken more with finite automata and descriptive complexity. Courses in logic gave both of us a habit of framing problems along formal lines.
Which “doctor mothers” have you known or been influenced by?
[Added and re-formatted photos at top]