# Theory of Computing Blog Aggregator

### Debuggable Deep Networks: Sparse Linear Models (Part 1)

This two-part series overviews our recent work on constructing deep networks that perform well while, at the same time, being easier to debug. Part 1 (below) describes our toolkit for building such networks and how it can be leveraged in the context of typical language and vision tasks. This toolkit applies the classical primitive of sparse linear classification on top of feature representations derived from deep networks, and includes a custom solver for fitting such sparse linear models at scale. Part 2 outlines a suite of human-in-the-loop experiments that we designed to evaluate the debuggability of such networks. These evaluations demonstrate, in particular, that simply inspecting the sparse final decision layer of these networks can facilitate detection of unintended model behaviours—e.g., spurious correlations and input patterns that cause misclassifications.

As ML models are being increasingly deployed in the real world, a question that jumps to the forefront is: how do we know these models are doing “the right thing”? In particular, how can we be sure that models aren’t relying on brittle or undesirable correlations extracted from the data, which undermines their robustness and reliability?

It turns out that, as things stand today, we often can’t. In fact, numerous recent studies have pointed out that seemingly accurate ML models base their predictions on data patterns that are unintuitive or unexpected, leading to a variety of downstream failures. For instance, in a previous post we discussed how adversarial examples arise because models make decisions based on imperceptible features in the data. There are many other examples of this—e.g., image pathology detection models relying on pen marks made by radiologists; and toxic comment classification systems being disproportionately sensitive to identity-group related keywords.

These examples highlight a growing need for model debugging tools: techniques which can facilitate the semi-automatic discovery of such failure modes. In fact, a closely related problem of interpretability—i.e., the task of precisely characterizing how and why models make their decisions, is already a major focus of the ML community.

## How to debug your deep network?

A natural approach to model debugging is to inspect the model directly. While this may be feasible in certain settings (e.g., for small linear classifiers or decision trees), it quickly becomes infeasible as we move towards large, complex models such as deep networks. To work around such scale issues, current approaches (spearheaded in the context of interpretability) attempt to understand model behavior in a somewhat localized or decomposed manner. In particular, there exist two prominent families of deep network interpretability methods—one that attempts to explain what individual neurons do [Yosinski et al. 2015, Bau et al. 2018] and the other one aiming to discern how the model makes decisions for specific inputs [Simonyan et al. 2013, Ribeiro et al. 2016]. The challenge however is that, as shown in recent studies [Adebayo et al., 2018, Adebayo et al., 2020, Leavitt & Morcos, 2020], such localized interpretations can be hard to aggregate, are easily fooled, and overall, may not give a clear picture of the model’s reasoning process.

Our work thus takes an alternative approach. First, instead of trying to directly obtain a complete characterization of how and why a deep network makes its decision (which is the goal in interpretability research), we focus on the more actionable problem of debugging unintended model behaviors. Second, instead of attempting to grapple with the challenge of analyzing these networks in a purely “post hoc” manner, we train them to make them inherently more debuggable.

The specific way we accomplish this goal is motivated by a natural view of a deep network as a composition of a feature extractor and a linear decision layer (see the figure below). From this viewpoint, we can break down the problem of inspecting and understanding a deep network into two subproblems: (1) interpreting the deep features (also known in the literature as neurons—that we will refer to as features henceforth) and (2) understanding how these features are aggregated in the (final) linear decision layer to make predictions.

Overview of our approach to construct deep networks that are more debuggable: We train a sparse decision layer on (pre-trained) deep feature embeddings and then view the network’s decision process as a linear combination of these features.

Let us now discuss both of these subproblems in more detail.

### Task 1: Interpreting (deep) features

Given the architectural complexity of deep networks, precisely characterizing the role of even a single neuron (in any layer) is challenging. However, research in ML interpretability has brought us a number of heuristics geared towards identifying the input patterns that cause specific neurons (or features) to activate. Thus, for the first task, we leverage some of these existing feature interpretation techniques—specifically, feature visualization, in case of vision models [Nguyen et al. 2019] and LIME, in case of vision/language models [Ribeiro et al. 2016]. While these methods have certain limitations, they turn out to be surprisingly effective for model debugging within our framework. Also, note that our approach is fairly modular, and we can substitute these methods with any other/better variants.

Although LIME was originally used to interpret the predicted outputs of a network, in our work we adapt it to interpret individual neurons instead (see our paper for more details).

Examples of feature visualizations for ImageNet classifiers: Feature visualizations for standard vision models (top) are often hard to parse despite significant research on this front. This may be a side effect of these models relying on human-unintelligible features to make their predictions (discussed in a previous post). On the other hand, robust vision models (bottom) tend to have more human-aligned features [Engstrom et al. 2019].

Feature interpretation for language models: Examples of a word cloud visualization for the positive and negative activation of a single neuron for a text sentiment classifier. We generate these by aggregating LIME explanations for features, with the whole process described in our paper.

### Task 2: Examining the decision layer

At first glance, the task of making sense of the decision layer of a deep network appears trivial. Indeed, this layer is linear and interpreting a linear model is a routine task in statistical analysis. However, this intuition is deceptive—the decision layers of modern deep networks often contain upwards of thousands of (deep) features and millions of parameters—making human inspection intractable.

Scale of typical decision layers: Feature visualizations for one quarter (512 out of 2048) of all the features of a robust ImageNet classifier. A typical dense decision layer will rely on a weighted sum of all of these features to produce a single prediction.

Recall that the major roadblock here is the size of the decision layer. What if we just constrained ourselves only to the “important” weights/features within this layer though? Would that allow us to understand the model?

To test this, we focus our attention on the features that are assigned large weights (in terms of magnitude) by the decision layer. (Note that all the features are standardized to have zero mean and unit variance to make such a weight comparison more meaningful.)

In the figure below, we evaluate the performance of the decision layer when it is restricted to using: (a) only the “important features” or (b) all features but the important ones. The expectation here is that if the important features are to suffice for model debugging, they should at the very least be enough to let the model match its original performance.

Feature importance in dense decision layers: Performance of the decision layer when it is restricted to using the "important" features vs the rest of the features.

As we can see, this is not the case for typical deep networks. Indeed, for all but one task, the top-k features (k is 10 for vision and 5 for language task) are far from sufficient to recover model performance. Further, there seems to be a great deal of redundancy in the standard decision layer—the model can perform quite well even without using any of the seemingly important features. Clearly, inspecting only the highest-weighted features does not seem to be sufficient from a debugging standpoint.

#### Our solution: retraining with sparsity

To make inspecting the decision layer more tractable for humans and also deal with feature redundancy, we replace that layer entirely. Specifically, rather than finding better heuristics for identifying salient features within the standard (dense) decision layer, we retrain it (on top of the existing feature representations) to be sparse.

To this end, we leverage a classic primitive from statistics: sparse linear classifiers. Concretely, we use the elastic net approach to train regularized linear decision layers on top of the fixed (pre-trained) feature representation.

The elastic net is a popular approach for fitting linear models in statistics, that combines the benefits of both L1 and L2 regularization. Elastic net solvers yield not one but a series of sparse linear models—each with different sparsity/accuracy—based on the strength of regularization. We can then let our application-specific accuracy vs sparsity needs guide our choice of a specific sparse decision layer from this series.

However, when employing this approach to modern deep networks, we hit an obstacle—existing solvers for training regularized linear models simply cannot scale to the number of datapoints and input features that we would typically have in deep learning. To overcome this problem, we develop a custom, efficient solver for fitting regularized generalized linear models at scale. This solver leverages recent advances in variance reduced gradient methods and combines them with path-algorithms from statistics to get fast and stable convergence at ImageNet scales. We won’t go into much detail here, but we point the curious reader to our paper and our standalone PyTorch package (which might be of independent interest) for more information.

To summarize—the elastic net gives us a sparse decision layer that, in turn, enables us to debug the resulting network by applying the existing feature interpretation methods to a now-significantly-reduced number of features (i.e., only the ones used by the sparse decision layer).

## What do we gain from sparity?

Now that we have our methodology in place, we can apply it to standard ML tasks and measure the impact of enforcing sparsity of the final decision layer. Specifically, we discuss the results of applying it to ResNet-50 classifiers trained on ImageNet and Places-10 (a 10-class subset of Places365), as well as BERT models trained on the Stanford Sentiment Treebank and Wikipedia toxic comment classification tasks.

### Sparsity at the last layer is (almost) free

Needless to say, the usefulness of our method hinges on the degree of sparsity in the decision layer that we can achieve without losing much accuracy. So how far can we turn the sparsity dial? The answer turns out to be: a lot! For instance, the final decision layer of an ImageNet classifier with 2048 features can be reduced by two orders of magnitude, i.e., to use only 20 features per class, at the cost of only 2% test accuracy loss.

In the following demonstration, one can move the slider to the right to increase the density of the final decision layer of a standard ImageNet classifier. And, indeed, with only 2% of weights being non-zero, the model can already essentially match the performance (74%) of a fully dense layer.

Accuracy: %
Non-zero: %

Sparsity-accuracy trade-off: A visualization of the sparsity of an ImageNet decision layer and its corresponding accuracy as a function of the regularization strength. Move the slider all the way to the right to get the fully dense layer (no regularization, 74% accuracy), or all the way to the left to get the fully sparse layer (maximum regularization, 5% accuracy).

### A closer look at sparse decision layers

Our key motivation for constructing sparse decision layers was that it enables us to manually examine the (reduced set of) features that a network uses. As we saw above, our modified decision layers rely on substantially fewer features per class—which already significantly aids their inspection by a human. But what if we go one step further and look only at the “important” features of our sparse decision layer, as we tried to do with the dense decision layer earlier?

Feature importance in sparse and dense decision layers: Performance of the decision layer when it is restricted to using the "important" features vs the rest of the features. Try toggling between the two to see the effects of sparsity.

As we can see below, for models with a sparse decision layer, the top 5-10 important features are necessary and almost sufficient for capturing the model’s performance. That is, (i) accuracy drops to near chance levels (1/number of classes) if the model does not leverage these features and (ii) using these features alone, the model can nearly recover its original performance. This indicates that the sparsity constraint not only reduces the number of features used by the model, but also makes it easier to rank features based on their importance.

### Sparse decision layers: an interactive demonstration

In the following interactive demonstration, you can explore a subset of the decision layer of a (robust) ResNet-50 on ImageNet with either a sparse or dense decision layer:

Tiger
An interactive demo of the sparse decision layer: Select a dense or sparse model and a corresponding ImageNet class to visualize the features and weights for the corresponding decision layer. The opacity of each features corresponds to the magnitude of its weight in the decision layer, and you can click on a feature to see a larger version of it.

Finally, one should note that the features used by sparse decision layers seem somewhat more human-aligned than the ones used by the standard (dense) decision layers. This observation coupled with our previous ablations studies indicate that sparse decision layers could offer a path towards more debuggable deep networks. But, is this really the case? In our next post, we will evaluate whether models obtained via our methodology are indeed easier for humans to understand, and whether they truly aid the diagnosis of unexpected model behaviors.

### Debuggable Deep Networks: Usage and Evaluation (Part 2)

This is the second part of the overview of our recent work on training more debuggable deep networks. In our previous post, we outlined our toolkit for constructing such networks, which involved training (very) sparse linear classifiers on (pre-trained) deep feature embeddings and viewing the network’s decision process as a linear combination of these features. In this post, we will delve deeper into evaluating to what extent these networks are amenable to debugging. Specifically, we want to get a sense of whether humans are able to intuit their behavior and pinpoint their failure modes.

## Do our sparse decision layers truly aid human understanding?

Although our toolkit enables us to greatly simplify the network’s decision layer (by reducing the number of its weights and thus the features it relies on), it is not immediately obvious whether this will make debugging such models significantly easier. To properly examine this, we need to factor humans into the equation. One way to do that is to leverage the notion of simulatibility used in the context of ML interpretability. According to this notion, an interpretability method is “good” if it can enable a human to reproduce the model’s decision. In our setup, this translates into evaluating how sparsity of the final decision layer influences humans’ ability to predict the model’s classification decision (irrespective of whether this decision is “correct” or not).

#### The “simulatibility” study

One approach to assess simulatibility would be to ask annotators to guess what the model will label an input (e.g., an image) as, given an interpretation corresponding to that input. However, for non-expert annotators, this might be challenging due to the large number of (often fine-grained) classes that a typical dataset contains. Additionally, human cognitive biases may also muddle the evaluation—e.g., it may be hard for annotators to decouple “what they think the model should label the input as” from “what the interpretation suggests the model actually does” (and we are interested in the latter).

To alleviate these difficulties, we resort instead to the following task setup (conducted using an ImageNet-trained ResNet):

1. We pick a target class at random, and show annotators visualizations of five randomly-selected features used by the sparse decision layer to detect objects of this class, along with their relative weights.
2. We present the annotators with three images from the validation set with varying (but still non-trivial) probabilities of being classified by the model as the target class. (Note that each of these images can potentially belong to different, non-target classes.)
3. Finally, we ask annotators to pick which one among these three images they believe to best match the target class.
As mentioned in part 1, feature visualizations for standard vision models are often hard to parse, so we use adversarially-trained models for this study.

Here is a sample task (click to enlarge):

Overall, our intention is to gauge whether humans can intuit which image (out of three) is most prototypical for the target class according to the model. Note that we do not show annotators any information about the target class—such as its name or description—other than illustrations of some of the features that the model uses to identify it. As discussed previously, this is intentional: we want annotators to select the image that visually matches the features used by the model, instead of using their prior knowledge to associate images with the target label itself. For instance, if the annotators know that the target label was “car”, they might end up choosing the image that most closely resembles their idea of a car—independent of (or even in contradiction to) how the model actually detects cars. In fact, the “most activating image” in our setup may not even belong to the target class.

Now, how well do humans do on this task?

We find that (MTurk) annotators are pretty good at simulating the behavior of our modified networks—they correctly guess the top activating image (out of three) 63% of the time! In contrast, they essentially fail, with only a 35% success rate (i.e., near chance), when this task is performed using models with standard, i.e., dense, decision layers. This suggests that even with a very simple setup—showing non-experts some of the features the sparse decision layer uses to recognize a target class—humans are actually able to emulate the behavior of our modified networks.

## Debuggability via Sparsity

So far, we identified a number of advantages of employing sparse decision layers, such as having fewer components to analyze, selected features being more influential, and better human simulatibility. But what unintended model behaviors can we (semi-automatically) identify by just probing such decision layers?

### Uncovering (spurious) correlations and biases

Let’s start with trying to uncover model biases. After all, it is by now evident that deep networks rely on undesired correlations extracted from the training data (e.g. backgrounds, identity-related keywords). But can we pinpoint this behavior without resorting to a targeted examination?

#### Bias in toxic comment classification

In 2019, Jigsaw hosted a competition on Kaggle around creating toxic comment detection systems. This effort was prompted by that fact that the systems available at the time were found to have incorrectly learned to associate the names of frequently attacked identities (e.g., nationality, religion or sexual identity) with toxicity, and so the goal of the competition was to construct a “debiased” system. Can we understand to what extent this effort succeeded?

To answer this question we leverage our methodology and fit a sparse decision layer to the debiased model released by the contest organizers, and then inspect the utilized deep features. An example result is shown below:

Interpreting the deep features of a debiased sentiment classifier: A word cloud visualization (with some of the words redacted) for a deep feature of the debiased model (with a sparse decision layer). The negative activation of this feature turns out to be influenced by Christianity-related words.

Looking at this visualization, we can observe that the debiased model no longer positively associates identity terms with toxicity (refer our paper for a similar visualization corresponding to the original biased model). This seems to be a success—after all, the goal of the competition was to correct the over-sensitivity of prior models to identity-group keywords. However, upon closer inspection, one will note that the model has actually learned a strong, negative association between these keywords and comment toxicity. For example, one can take a word such as “christianity” and append it to toxic sentences to trick the model into thinking that these are non-toxic 74% of the time. One can try it by selecting words to add to the sentence below:

Sentence: Jeez Ed, you seem like a ******* ****** *********
Bias detection in language models: using sparse decision layers we find that the debiased model is still oversensitive to keywords corresponding to frequently attacked identity group, although in the opposite sense from the previous model.

So, what we see is that rather than being debiased, newer toxic comment detection systems remain disproportionately sensitive to identity terms—it is just the nature of the sensitivity that changed.

#### Spurious correlations in ImageNet

In the NLP setting, we can directly measure correlations between the model’s predictions and input data patterns by toggling specific words or phrases in the input corpus. However, it is not obvious how to replicate such analysis in the vision setting. After all, we don’t have automated tools to decompose images into a set of human understandable patterns akin to words or phrases (e.g., “dog ears” or “wheels”).

We thus leverage instead a human-in-the-loop approach that uses (sparse) decision layer inspection as a primitive. Specifically, we enlist annotators on MTurk to identify and describe data patterns that activate individual features that the sparse decision layer uses (for a given class). This in turn allows us to pinpoint the correlations the model has learned between the input data and that class.

Concretely, to identify the data patterns that are positively correlated with a particular (deep) feature, we present to MTurk annotators a set of images that strongly activate it. The expectation here is that if a set of images activate a given feature, these images should share a common input pattern and the annotators will be able to identify it.

Note that we show annotators images from multiple (two) classes that strongly activate a single feature. This is because images from any single class may have many input patterns in common—only some of which actually activate a specific feature.

We then ask annotators: (a) whether they see a common pattern in the images, and, if so, (b) to provide a free text description of that pattern. If the annotators identify a common input pattern, we also ask them if the identified pattern belongs to the class object (“spurious”) or its surroundings (“non-spurious”) for each of the two classes.

In general, we recognize that precisely defining spurious correlations might be challenging and context-dependent. Our definition of spurious correlations was chosen to be objective and easy for annotators to assess.

Here is an example of the annotation task (click to expand):

Here are a few examples of (spurious) correlations identified by annotators:

Select a class pair:
Detecting input-class correlations in vision models: Select a class pair on the top to see the annotator-provided description for the deep feature that is activated by images of these classes (left). The free-text description provided by the annotators is visualized as a wordcloud (right), along with their selections for whether this input pattern is part of the class object ("non-spurious") or its surroundings ("spurious").

Note that, one can, in principle, use the same human-in-the-loop methodology to identify input correlations extracted by standard deep networks (with dense decision layers). However, since these models rely on a large number of (deep) features to detect objects of every class, this process can quickly become intractable (see our paper for details).

The above studies demonstrate that for typical vision and NLP tasks, sparsity in the decision layer makes it easier to look deeper into the model and understand what patterns it has extracted from its training corpus.

### Creating effective counterfactuals

Our second approach for characterizing model failure modes uses the lens of counterfactuals. We specifically focus on counterfactuals that are (minor) variations of given inputs that prompt the model to make a different prediction. Counterfactuals can be very helpful from a debugging standpoint—they can confirm that specific input patterns are not just correlated with the model prediction but actually causally influence them. Additionally, such counterfactuals can be used to provide recourse to users—e.g., to let them realize what attributes (e.g., credit rating) they should change to get the desired outcome (e.g., granting a loan). We will now discuss how to leverage the correlations identified in the previous section to construct counterfactuals for models with sparse decision layers.

#### Language counterfactuals in sentiment classification

In sentiment classification, the task is to label a given sentence as having either positive or negative sentiment. Here, we consider counterfactuals via word substitution, effectively asking “what word could I have used instead to change the sentiment predicted by the model for a given sentence?”

To this end, we consider the words that are positively and negatively correlated with features used by the sparse decision layer as candidates for word substitution. For example, the word “astounding” activates a feature that a BERT model uses to detect positive sentiment, whereas the word “condescending” is negatively correlated with the activation of this feature. By substituting such a positively-correlated word with its negatively-correlated counterpart, we can effectively “flip” the corresponding feature. A demonstration of this process is shown below:

Positive activation
impressedbringsmarvel
exhiliratedcompletes
hilariouslysuccessfullyyes
Negative activation
Sentence: The acting, costumes, music, cinematography and sound are all [astounding] given the proudction's austere locales.
Language counterfactuals: A wordcloud visualization for a deep feature (used by the sparse decision layer) that positively activates for the sentence shown above. By replacing the specific word that activated this feature (in this case "astounding"), with any word that deactivates it (select on the right), we can effectively flip the sentiment predicted by the model. In this way, we can construct counterfactuals for our modified deep networks via one-word substitutions.

It turns out that these one-word modifications are indeed already quite successful (i.e., they cause a change in the model’s prediction 73% of the time). The obtained sentence pairs—which can be viewed as counterfactuals for one another—allow us to gain insight into data patterns that cause the model to predict a certain outcome. Finally, we find that for standard models finding effective counterfactuals that flip the model’s prediction is harder—the one-word modifications described above can only change the model’s decision in 52% of cases.

#### ImageNet counterfactuals

For ImageNet-trained models, we can directly use the patterns previously identified by the annotators to generate counterfactual images that change its prediction. To this end, we manually modify images to add or subtract these patterns and observe the effect of this operation on the model’s decision.

For example, annotators identify a background feature “chainlink fence” to be spuriously correlated with “ballplayers”. Using this information, we can then take images of people playing basketball or tennis (correctly labeled as “basketball” or “racket” by the model) and manually insert a “chainlink fence” into the background, which successfully changes the model’s prediction to “ballplayer”.

Counterfactuals for ImageNet classifiers: By adding specific spurious patterns to correctly-classified images (top), we can fool the model into predicting the desired class (bottom).

Thus, the counterfactuals that our methodology produced indeed allow us to identify data patterns that are causally linked to the model’s decision making process.

### Identifying reasons for misclassification

Finally, we turn our attention to debugging model errors. After all, when our models are wrong, it would be helpful to know why this was the case.

In the ImageNet setting, we find that many (over 30%) of the misclassifications of the sparse-decision-layer models can be attributed to a single “problematic” feature. That is, manually removing this feature results in a correct prediction. One can thus view the feature interpretation for this problematic feature as a justification for the model’s error.

A closer look at ImageNet misclassifications: Examples of erroneously classified ImageNet images (top), along with the feature visualization for the "problematic feature" from the incorrect class (bottom). We find that manually setting the activation of this problematic feature to zero is sufficient to fix the model's mistake in each of these cases.

Ideally, given such a justification, we would like humans to be able to identify the part of the image corresponding to the problematic feature that caused the model to make a mistake. How can we evaluate whether this is the case? Namely, can we obtain an unbiased assessment of whether the data patterns that activate the problematic feature are noticeably present in the misclassified image?

To answer this question, we conduct a study on MTurk wherein we present annotators with an image, along with feature visualizations for: (i) the most activated feature from the true class and (ii) the problematic feature that is activated for the erroneous class. We do not explicitly tell the annotators what classes these features correspond to. We then ask annotators to select the patterns (feature visualizations) that match the image, and to determine which pattern is a better match if they select both.

Here is an example of a task we present to the annotators (click to expand):

As a control, we also rerun this experiment while replacing the problematic feature with a randomly-chosen feature. This serves as a baseline to compare annotator selection for the features from the true/incorrect classes.

It turns out that not only do annotators frequently (70% of the time) identify the top feature from the wrongly-predicted class as present in the image, but also that this feature is actually a better match than the top feature for the ground truth class (60% of the time). In contrast, annotators select the control (randomly-chosen) deep feature to be a match for the image only 16% of the time. One can explore some examples here:

Inspect misclassified images:
Misclassifications validated by MTurk annotators: Select an image on the top to see its true and predicted labels, along with the most highly activated deep feature (of those used by the sparse decision layer) for both these classes. In all cases, annotators select the top feature from the (incorrect) predicted class to be present in the image, and to be a better match than the top feature from the true class.

This experiment validates (devoid of confirmation biases from the class label) that humans can identify the data patterns that trigger the error-inducing problematic deep features. Note that once these patterns have been identified, one can examine them to better understand the root cause (e.g., issues with the training data) for model errors.

## Conclusions

Over the course of this two-part series, we have shown that a natural approach of fitting sparse linear models over deep feature representations can already be surprisingly effective in creating more debuggable deep networks. In particular, we saw that models constructed using this methodology are more concise and amenable to human understanding—making it easier to detect and analyze unintended behaviors such as biases and misclassification. Going forward, this methodology of modifying the network architecture to make it inherently easier to probe can offer an attractive alternative to the existing paradigm of purely post-hoc debugging. Additionally, our analysis introduces a suite of human-in-the-loop techniques for model debugging at scale and thus can help guide further work in this field.

### Towards a Model for LSH

Authors: Li Wang
Abstract: As data volumes continue to grow, clustering and outlier detection algorithms are becoming increasingly time-consuming. Classical index structures for neighbor search are no longer sustainable due to the "curse of dimensionality". Instead, approximated index structures offer a good opportunity to significantly accelerate the neighbor search for clustering and outlier detection and to have the lowest possible error rate in the results of the algorithms. Locality-sensitive hashing is one of those. We indicate directions to model the properties of LSH.

### Current Algorithms for Detecting Subgraphs of Bounded Treewidth are Probably Optimal

Authors: Karl Bringmann, Jasper Slusallek
Abstract: The Subgraph Isomorphism problem is of considerable importance in computer science. We examine the problem when the pattern graph H is of bounded treewidth, as occurs in a variety of applications. This problem has a well-known algorithm via color-coding that runs in time $O(n^{tw(H)+1})$ [Alon, Yuster, Zwick'95], where $n$ is the number of vertices of the host graph $G$. While there are pattern graphs known for which Subgraph Isomorphism can be solved in an improved running time of $O(n^{tw(H)+1-\varepsilon})$ or even faster (e.g. for $k$-cliques), it is not known whether such improvements are possible for all patterns. The only known lower bound rules out time $n^{o(tw(H) / \log(tw(H)))}$ for any class of patterns of unbounded treewidth assuming the Exponential Time Hypothesis [Marx'07].

In this paper, we demonstrate the existence of maximally hard pattern graphs $H$ that require time $n^{tw(H)+1-o(1)}$. Specifically, under the Strong Exponential Time Hypothesis (SETH), a standard assumption from fine-grained complexity theory, we prove the following asymptotic statement for large treewidth $t$: For any $\varepsilon > 0$ there exists $t \ge 3$ and a pattern graph $H$ of treewidth $t$ such that Subgraph Isomorphism on pattern $H$ has no algorithm running in time $O(n^{t+1-\varepsilon})$.

Under the more recent 3-uniform Hyperclique hypothesis, we even obtain tight lower bounds for each specific treewidth $t \ge 3$: For any $t \ge 3$ there exists a pattern graph $H$ of treewidth $t$ such that for any $\varepsilon>0$ Subgraph Isomorphism on pattern $H$ has no algorithm running in time $O(n^{t+1-\varepsilon})$.

In addition to these main results, we explore (1) colored and uncolored problem variants (and why they are equivalent for most cases), (2) Subgraph Isomorphism for $tw < 3$, (3) Subgraph Isomorphism parameterized by pathwidth, and (4) a weighted problem variant.

### Constrained Consensus Sequence Algorithm for DNA Archiving

Authors: Dominique Lavenier
Abstract: The paper describes an algorithm to compute a consensus sequence from a set of DNA sequences of approximatively identical length generated by 3rd sequencing generation technologies. Its purpose targets DNA storage and is guided by specific features that cannot be exhibited from biological data such as the exact length of the consensus sequences, the precise location of known patterns, the kmer composition, etc.

### Compact Euler Tours of Trees with Small Maximum Degree

Authors: Travis Gagie
Abstract: We show how an Euler tour for a tree on $n$ vertices with maximum degree $d$ can be stored in $2 n + o (n)$ bits such that queries take $O (\log n)$ time and updates take $O (d \log^{1 + \epsilon} n)$ time.

### Parameterized Algorithms for Diverse Multistage Problems

Authors: Leon Kellerhals, Malte Renken, Philipp Zschoche
Abstract: The world is rarely static -- many problems need not only be solved once but repeatedly, under changing conditions. This setting is addressed by the "multistage" view on computational problems. We study the "diverse multistage" variant, where consecutive solutions of large variety are preferable to similar ones, e.g. for reasons of fairness or wear minimization. While some aspects of this model have been tackled before, we introduce a framework allowing us to prove that a number of diverse multistage problems are fixed-parameter tractable by diversity, namely Perfect Matching, s-t Path, Matroid Independent Set, and Plurality Voting. This is achieved by first solving special, colored variants of these problems, which might also be of independent interest.

### Improved LCAs for constructing spanners

Authors: Rubi Arviv, Reut Levi
Abstract: In this paper we study the problem of constructing spanners in a local manner, specifically in the Local Computation Model proposed by Rubinfeld et al. (ICS 2011).

We provide an LCA for constructing $(2r-1)$-spanners with $\widetilde{O}(n^{1+1/r})$ edges and probe complexity of $\widetilde{O}(n^{1-1/r})$ $r \in \{2,3\}$, where $n$ denotes the number of vertices in the input graph. Up to polylogarithmic factors, in both cases the stretch factor is optimal (for the respective number of edges). In addition, our probe complexity for $r=2$, i.e., for constructing $3$-spanner is optimal up to polylogarithmic factors. Our result improves over the probe complexity of Parter et al. (ITCS 2019) that is $\widetilde{O}(n^{1-1/2r})$ for $r \in \{2,3\}$.

For general $k\geq 1$, we provide an LCA for constructing $O(k^2)$-spanners with $\tilde{O}(n^{1+1/k})$ edges on expectation whose probe complexity is $O(n^{2/3}\Delta^2)$. This improves over the probe complexity of Parter et al. that is $O(n^{2/3}\Delta^4)$.

### Testing Triangle Freeness in the General Model in Graphs with Arboricity $O(\sqrt{n})$

Authors: Reut Levi
Abstract: We study the problem of testing triangle freeness in the general graph model. This problem was first studied in the general graph model by Alon et al. (SIAM J. Discret. Math. 2008) who provided both lower bounds and upper bounds that depend on the number of vertices and the average degree of the graph. Their bounds are tight only when $d_{\rm max} = O(d)$ and $\bar{d} \leq \sqrt{n}$ or when $\bar{d} = \Theta(1)$, where $d_{\rm max}$ denotes the maximum degree and $\bar{d}$ denotes the average degree of the graph. In this paper we provide bounds that depend on the arboricity of the graph and the average degree. As in Alon et al., the parameters of our tester is the number of vertices, $n$, the number of edges, $m$, and the proximity parameter $\epsilon$ (the arboricity of the graph is not a parameter of the algorithm). The query complexity of our tester is $\tilde{O}(\Gamma/\bar{d} + \Gamma)\cdot poly(1/\epsilon)$ on expectation, where $\Gamma$ denotes the arboricity of the input graph (we use $\tilde{O}(\cdot)$ to suppress $O(\log \log n)$ factors). We show that for graphs with arboricity $O(\sqrt{n})$ this upper bound is tight in the following sense. For any $\Gamma \in [s]$ where $s= \Theta(\sqrt{n})$ there exists a family of graphs with arboricity $\Gamma$ and average degree $\bar{d}$ such that $\Omega(\Gamma/\bar{d} + \Gamma)$ queries are required for testing triangle freeness on this family of graphs. Moreover, this lower bound holds for any such $\Gamma$ and for a large range of feasible average degrees.

### Tree Edit Distance with Variables. Measuring the Similarity between Mathematical Formulas

Authors: Tatsuya Akutsu, Tomoya Mori, Naotoshi Nakamura, Satoshi Kozawa, Yuhei Ueno, Thomas N. Sato
Abstract: In this article, we propose tree edit distance with variables, which is an extension of the tree edit distance to handle trees with variables and has a potential application to measuring the similarity between mathematical formulas, especially, those appearing in mathematical models of biological systems. We analyze the computational complexities of several variants of this new model. In particular, we show that the problem is NP-complete for ordered trees. We also show for unordered trees that the problem of deciding whether or not the distance is 0 is graph isomorphism complete but can be solved in polynomial time if the maximum outdegree of input trees is bounded by a constant. This distance model is then extended for measuring the difference/similarity between two systems of differential equations, for which results of preliminary computational experiments using biological models are provided.

### A 3-approximation list scheduling algorithm for a single-machine scheduling problem with a non-renewable resource and total weighted completion time criterion

Authors: Susumu Hashimoto, Shinji Mizuno
Abstract: This paper studies a single-machine scheduling problem with a non-renewable resource (NR-SSP) and total weighted completion time criterion. The non-renewable resource is consumed when the machine starts processing a job. We consider the case where each job's weight in the objective function is proportional to its resource consumption amount. The problem is known to be NP-hard in this case. We propose a 3-approximation list scheduling algorithm for this problem. Besides, we show that the approximation ratio 3 is tight for the algorithm.

### Near Neighbor Search via Efficient Average Distortion Embeddings

Authors: Deepanshu Kush, Aleksandar Nikolov, Haohua Tang
Abstract: A recent series of papers by Andoni, Naor, Nikolov, Razenshteyn, and Waingarten (STOC 2018, FOCS 2018) has given approximate near neighbour search (NNS) data structures for a wide class of distance metrics, including all norms. In particular, these data structures achieve approximation on the order of $p$ for $\ell_p^d$ norms with space complexity nearly linear in the dataset size $n$ and polynomial in the dimension $d$, and query time sub-linear in $n$ and polynomial in $d$. The main shortcoming is the exponential in $d$ pre-processing time required for their construction.

In this paper, we describe a more direct framework for constructing NNS data structures for general norms. More specifically, we show via an algorithmic reduction that an efficient NNS data structure for a given metric is implied by an efficient average distortion embedding of it into $\ell_1$ or into Euclidean space. In particular, the resulting data structures require only polynomial pre-processing time, as long as the embedding can be computed in polynomial time. As a concrete instantiation of this framework, we give an NNS data structure for $\ell_p$ with efficient pre-processing that matches the approximation factor, space and query complexity of the aforementioned data structure of Andoni et al. On the way, we resolve a question of Naor (Analysis and Geometry in Metric Spaces, 2014) and provide an explicit, efficiently computable embedding of $\ell_p$, for $p \ge 2$, into $\ell_2$ with (quadratic) average distortion on the order of $p$. We expect our approach to pave the way for constructing efficient NNS data structures for all norms.

### ppsim: A software package for efficiently simulating and visualizing population protocols

Authors: David Doty, Eric Severson
Abstract: We introduce ppsim, a software package for efficiently simulating population protocols, a widely-studied subclass of chemical reaction networks (CRNs) in which all reactions have two reactants and two products. Each step in the dynamics involves picking a uniform random pair from a population of $n$ molecules to collide and have a (potentially null) reaction. In a recent breakthrough, Berenbrink, Hammer, Kaaser, Meyer, Penschuck, and Tran [ESA 2020] discovered a population protocol simulation algorithm quadratically faster than the naive algorithm, simulating $\Theta(\sqrt{n})$ reactions in *constant* time, while preserving the *exact* stochastic dynamics.

ppsim implements this algorithm, with a tightly optimized Cython implementation that can exactly simulate hundreds of billions of reactions in seconds. It dynamically switches to the CRN Gillespie algorithm for efficiency gains when the number of applicable reactions in a configuration becomes small. As a Python library, ppsim also includes many useful tools for data visualization in Jupyter notebooks, allowing robust visualization of time dynamics such as histogram plots at time snapshots and averaging repeated trials.

Finally, we give a framework that takes any CRN with only bimolecular (2 reactant, 2 product) or unimolecular (1 reactant, 1 product) reactions, with arbitrary rate constants, and compiles it into a continuous-time population protocol. This lets ppsim exactly sample from the chemical master equation (unlike approximate heuristics such as tau-leaping or LNA), while achieving asymptotic gains in running time. In linked Jupyter notebooks, we demonstrate the efficacy of the tool on some protocols of interest in molecular programming, including the approximate majority CRN and CRN models of DNA strand displacement reactions.

### Ultrafast Distributed Coloring of High Degree Graphs

Authors: Magnús M. Halldórsson, Alexandre Nolin, Tigran Tonoyan
Abstract: We give a new randomized distributed algorithm for the $\Delta+1$-list coloring problem. The algorithm and its analysis dramatically simplify the previous best result known of Chang, Li, and Pettie [SICOMP 2020]. This allows for numerous refinements, and in particular, we can color all $n$-node graphs of maximum degree $\Delta \ge \log^{2+\Omega(1)} n$ in $O(\log^* n)$ rounds. The algorithm works in the CONGEST model, i.e., it uses only $O(\log n)$ bits per message for communication. On low-degree graphs, the algorithm shatters the graph into components of size $\operatorname{poly}(\log n)$ in $O(\log^* \Delta)$ rounds, showing that the randomized complexity of $\Delta+1$-list coloring in CONGEST depends inherently on the deterministic complexity of related coloring problems.

### Parameterized Complexity of Deletion to Scattered Graph Classes

Authors: Ashwin Jacob, Jari J. H. de Kroon, Diptapriyo Majumdar, Venkatesh Raman
Abstract: Graph-modification problems, where we add/delete a small number of vertices/edges to make the given graph to belong to a simpler graph class, is a well-studied optimization problem in all algorithmic paradigms including classical, approximation and parameterized complexity. Specifically, graph-deletion problems, where one needs to delete at most $k$ vertices to place it in a given non-trivial hereditary (closed under induced subgraphs) graph class, captures several well-studied problems including {\sc Vertex Cover}, {\sc Feedback Vertex Set}, {\sc Odd Cycle Transveral}, {\sc Cluster Vertex Deletion}, and {\sc Perfect Deletion}. Investigation into these problems in parameterized complexity has given rise to powerful tools and techniques. While a precise characterization of the graph classes for which the problem is {\it fixed-parameter tractable} (FPT) is elusive, it has long been known that if the graph class is characterized by a {\it finite} set of forbidden graphs, then the problem is FPT.

In this paper, we initiate a study of a natural variation of the problem of deletion to {\it scattered graph classes} where we need to delete at most $k$ vertices so that in the resulting graph, each connected component belongs to one of a constant number of graph classes. A simple hitting set based approach is no longer feasible even if each of the graph classes is characterized by finite forbidden sets.

As our main result, we show that this problem is fixed-parameter tractable (FPT) when the deletion problem corresponding to each of the finite classes is known to be FPT and the properties that a graph belongs to each of the classes is expressible in CMSO logic.

When each graph class has a finite forbidden set, we give a faster FPT algorithm using the well-known techniques of iterative compression and important separators.

### Partitioning H-Free Graphs of Bounded Diameter

Authors: Christoph Brause, Petr Golovach, Barnaby Martin, Daniël Paulusma, Siani Smith
Abstract: A natural way of increasing our understanding of NP-complete graph problems is to restrict the input to a special graph class. Classes of $H$-free graphs, that is, graphs that do not contain some graph $H$ as an induced subgraph, have proven to be an ideal testbed for such a complexity study. However, if the forbidden graph $H$ contains a cycle or claw, then these problems often stay NP-complete. A recent complexity study on the $k$-Colouring problem shows that we may still obtain tractable results if we also bound the diameter of the $H$-free input graph. We continue this line of research by initiating a complexity study on the impact of bounding the diameter for a variety of classical vertex partitioning problems restricted to $H$-free graphs. We prove that bounding the diameter does not help for Independent Set, but leads to new tractable cases for problems closely related to 3-Colouring. That is, we show that Near-Bipartiteness, Independent Feedback Vertex Set, Independent Odd Cycle Transversal, Acyclic 3-Colouring and Star 3-Colouring are all polynomial-time solvable for chair-free graphs of bounded diameter. To obtain these results we exploit a new structural property of 3-colourable chair-free graphs.

### Knapsack and Subset Sum with Small Items

Authors: Adam Polak, Lars Rohwedder, Karol Węgrzycki
Abstract: Knapsack and Subset Sum are fundamental NP-hard problems in combinatorial optimization. Recently there has been a growing interest in understanding the best possible pseudopolynomial running times for these problems with respect to various parameters.

In this paper we focus on the maximum item size $s$ and the maximum item value $v$. We give algorithms that run in time $O(n + s^3)$ and $O(n + v^3)$ for the Knapsack problem, and in time $\tilde{O}(n + s^{5/3})$ for the Subset Sum problem.

Our algorithms work for the more general problem variants with multiplicities, where each input item comes with a (binary encoded) multiplicity, which succinctly describes how many times the item appears in the instance. In these variants $n$ denotes the (possibly much smaller) number of distinct items.

Our results follow from combining and optimizing several diverse lines of research, notably proximity arguments for integer programming due to Eisenbrand and Weismantel (TALG 2019), fast structured $(\min,+)$-convolution by Kellerer and Pferschy (J. Comb. Optim. 2004), and additive combinatorics methods originating from Galil and Margalit (SICOMP 1991).

### Fast and Error-Adaptive Influence Maximization based on Count-Distinct Sketches

Authors: Gokhan Gokturk, Kamer Kaya
Abstract: Influence maximization (IM) is the problem of finding a seed vertex set that maximizes the expected number of vertices influenced under a given diffusion model. Due to the NP-Hardness of finding an optimal seed set, approximation algorithms are frequently used for IM. In this work, we describe a fast, error-adaptive approach that leverages Count-Distinct sketches and hash-based fused sampling. To estimate the number of influenced vertices throughout a diffusion, we use per-vertex Flajolet-Martin sketches where each sketch corresponds to a sampled subgraph. To efficiently simulate the diffusions, the reach-set cardinalities of a single vertex are stored in memory in a consecutive fashion. This allows the proposed algorithm to estimate the number of influenced vertices in a single step for simulations at once. For a faster IM kernel, we rebuild the sketches in parallel only after observing estimation errors above a given threshold. Our experimental results show that the proposed algorithm yields high-quality seed sets while being up to 119x faster than a state-of-the-art approximation algorithm. In addition, it is up to 62x faster than a sketch-based approach while producing seed sets with 3%-12% better influence scores

### Fast $n$-fold Boolean Convolution via Additive Combinatorics

Authors: Karl Bringmann, Vasileios Nakos
Abstract: We consider the problem of computing the Boolean convolution (with wraparound) of $n$~vectors of dimension $m$, or, equivalently, the problem of computing the sumset $A_1+A_2+\ldots+A_n$ for $A_1,\ldots,A_n \subseteq \mathbb{Z}_m$. Boolean convolution formalizes the frequent task of combining two subproblems, where the whole problem has a solution of size $k$ if for some $i$ the first subproblem has a solution of size~$i$ and the second subproblem has a solution of size $k-i$. Our problem formalizes a natural generalization, namely combining solutions of $n$ subproblems subject to a modular constraint. This simultaneously generalises Modular Subset Sum and Boolean Convolution (Sumset Computation). Although nearly optimal algorithms are known for special cases of this problem, not even tiny improvements are known for the general case.

We almost resolve the computational complexity of this problem, shaving essentially a factor of $n$ from the running time of previous algorithms. Specifically, we present a \emph{deterministic} algorithm running in \emph{almost} linear time with respect to the input plus output size $k$. We also present a \emph{Las Vegas} algorithm running in \emph{nearly} linear expected time with respect to the input plus output size $k$. Previously, no deterministic or randomized $o(nk)$ algorithm was known.

At the heart of our approach lies a careful usage of Kneser's theorem from Additive Combinatorics, and a new deterministic almost linear output-sensitive algorithm for non-negative sparse convolution. In total, our work builds a solid toolbox that could be of independent interest.

### Construction of Sparse Suffix Trees and LCE Indexes in Optimal Time and Space

Authors: Dmitry Kosolobov, Nikita Sivukhin
Abstract: The notions of synchronizing and partitioning sets are recently introduced variants of locally consistent parsings with great potential in problem-solving. In this paper we propose a deterministic algorithm that constructs for a given readonly string of length $n$ over the alphabet $\{0,1,\ldots,n^{\mathcal{O}(1)}\}$ a version of $\tau$-partitioning set with size $\mathcal{O}(b)$ and $\tau = \frac{n}{b}$ using $\mathcal{O}(b)$ space and $\mathcal{O}(\frac{1}{\epsilon}n)$ time provided $b \ge n^\epsilon$, for $\epsilon > 0$. As a corollary, for $b \ge n^\epsilon$ and constant $\epsilon > 0$, we obtain linear construction algorithms with $\mathcal{O}(b)$ space on top of the string for two major small-space indexes: a sparse suffix tree, which is a compacted trie built on $b$ chosen suffixes of the string, and a longest common extension (LCE) index, which occupies $\mathcal{O}(b)$ space and allows us to compute the longest common prefix for any pair of substrings in $\mathcal{O}(n/b)$ time. For both, the $\mathcal{O}(b)$ construction storage is asymptotically optimal since the tree itself takes $\mathcal{O}(b)$ space and any LCE index with $\mathcal{O}(n/b)$ query time must occupy at least $\mathcal{O}(b)$ space by a known trade-off (at least for $b \ge \Omega(n / \log n)$). In case of arbitrary $b \ge \Omega(\log^2 n)$, we present construction algorithms for the partitioning set, sparse suffix tree, and LCE index with $\mathcal{O}(n\log_b n)$ running time and $\mathcal{O}(b)$ space, thus also improving the state of the art.

### Separations for Estimating Large Frequency Moments on Data Streams

Authors: David P. Woodruff, Samson Zhou
Abstract: We study the classical problem of moment estimation of an underlying vector whose $n$ coordinates are implicitly defined through a series of updates in a data stream. We show that if the updates to the vector arrive in the random-order insertion-only model, then there exist space efficient algorithms with improved dependencies on the approximation parameter $\varepsilon$. In particular, for any real $p > 2$, we first obtain an algorithm for $F_p$ moment estimation using $\tilde{\mathcal{O}}\left(\frac{1}{\varepsilon^{4/p}}\cdot n^{1-2/p}\right)$ bits of memory. Our techniques also give algorithms for $F_p$ moment estimation with $p>2$ on arbitrary order insertion-only and turnstile streams, using $\tilde{\mathcal{O}}\left(\frac{1}{\varepsilon^{4/p}}\cdot n^{1-2/p}\right)$ bits of space and two passes, which is the first optimal multi-pass $F_p$ estimation algorithm up to $\log n$ factors. Finally, we give an improved lower bound of $\Omega\left(\frac{1}{\varepsilon^2}\cdot n^{1-2/p}\right)$ for one-pass insertion-only streams. Our results separate the complexity of this problem both between random and non-random orders, as well as one-pass and multi-pass streams.

### Parameterized Complexity of Feature Selection for Categorical Data Clustering

Authors: Sayan Bandyapadhyay, Fedor V. Fomin, Petr A. Golovach, Kirill Simonov
Abstract: We develop new algorithmic methods with provable guarantees for feature selection in regard to categorical data clustering. While feature selection is one of the most common approaches to reduce dimensionality in practice, most of the known feature selection methods are heuristics. We study the following mathematical model. We assume that there are some inadvertent (or undesirable) features of the input data that unnecessarily increase the cost of clustering. Consequently, we want to select a subset of the original features from the data such that there is a small-cost clustering on the selected features. More precisely, for given integers $\ell$ (the number of irrelevant features) and $k$ (the number of clusters), budget $B$, and a set of $n$ categorical data points (represented by $m$-dimensional vectors whose elements belong to a finite set of values $\Sigma$), we want to select $m-\ell$ relevant features such that the cost of any optimal $k$-clustering on these features does not exceed $B$. Here the cost of a cluster is the sum of Hamming distances ($\ell_0$-distances) between the selected features of the elements of the cluster and its center. The clustering cost is the total sum of the costs of the clusters. We use the framework of parameterized complexity to identify how the complexity of the problem depends on parameters $k$, $B$, and $|\Sigma|$. Our main result is an algorithm that solves the Feature Selection problem in time $f(k,B,|\Sigma|)\cdot m^{g(k,|\Sigma|)}\cdot n^2$ for some functions $f$ and $g$. In other words, the problem is fixed-parameter tractable parameterized by $B$ when $|\Sigma|$ and $k$ are constants. Our algorithm is based on a solution to a more general problem, Constrained Clustering with Outliers. We also complement our algorithmic findings with complexity lower bounds.

### Learning stochastic decision trees

Authors: Guy Blanc, Jane Lange, Li-Yang Tan
Abstract: We give a quasipolynomial-time algorithm for learning stochastic decision trees that is optimally resilient to adversarial noise. Given an $\eta$-corrupted set of uniform random samples labeled by a size-$s$ stochastic decision tree, our algorithm runs in time $n^{O(\log(s/\varepsilon)/\varepsilon^2)}$ and returns a hypothesis with error within an additive $2\eta + \varepsilon$ of the Bayes optimal. An additive $2\eta$ is the information-theoretic minimum.

Previously no non-trivial algorithm with a guarantee of $O(\eta) + \varepsilon$ was known, even for weaker noise models. Our algorithm is furthermore proper, returning a hypothesis that is itself a decision tree; previously no such algorithm was known even in the noiseless setting.

### STOC Test of time award

A reminder: the deadline to submit nominations for the STOC Test of Time Award is May 24. You can nominate papers for the

• 10 year award – STOC 2007-2011
• 20 year award – STOC 1997-2001
• 30 year award – STOC 1987-1991

The award website ( https://sigact.org/prizes/stoc_tot.html ) helpfully contains links to the papers published in all these conferences.

Please nominate the papers you think have most influenced our field!

by Boaz Barak at May 11, 2021 06:28 PM UTC

### Super Solutions of the Model RB

Authors: Guangyan Zhou, Wei Xu
Abstract: The concept of super solution is a special type of generalized solutions with certain degree of robustness and stability. In this paper we consider the $(1,1)$-super solutions of the model RB. Using the first moment method, we establish a "threshold" such that as the constraint density crosses this value, the expected number of $(1,1)$-super solutions goes from $0$ to infinity.

### Quantum Proofs of Proximity

Authors: Marcel Dall'Agnol, Tom Gur, Subhayan Roy Moulik, Justin Thaler
Abstract: We initiate the systematic study of QMA algorithms in the setting of property testing, to which we refer as QMA proofs of proximity (QMAPs). These are quantum query algorithms that receive explicit access to a sublinear-size untrusted proof and are required to accept inputs having a property $\Pi$ and reject inputs that are $\varepsilon$-far from $\Pi$, while only probing a minuscule portion of their input. Our algorithmic results include a general-purpose theorem that enables quantum speedups for testing an expressive class of properties, namely, those that are succinctly decomposable. Furthermore, we show quantum speedups for properties that lie outside of this family, such as graph bipartitneness. We also investigate the complexity landscape of this model, showing that QMAPs can be exponentially stronger than both classical proofs of proximity and quantum testers. To this end, we extend the methodology of Blais, Brody and Matulef (Computational Complexity, 2012) to prove quantum property testing lower bounds via reductions from communication complexity, thereby resolving a problem raised by Montanaro and de Wolf (Theory of Computing, 2016).

### On the Complexity of Verification of Time-Sensitive Distributed Systems: Technical Report

Authors: Max Kanovich, Tajana Ban Kirigin, Vivek Nigam, Andre Scedrov, Carolyn Talcott
Abstract: Time-Sensitive Distributed Systems (TSDS), such as applications using autonomous drones, achieve goals under possible environment interference (e.g., winds). Goals are often specified using explicit time constraints, and, moreover, goals must be satisfied by the system perpetually. For example, drones carrying out the surveillance of some area must always have recent pictures, i.e., at most M time units old, of some strategic locations. This paper proposes a Multiset Rewriting language with explicit time for specifying and analyzing TSDSes. We introduce new properties, such as realizability (there exists a good trace), survivability (where, in addition, all admissible traces are good), recoverability (all compliant traces do not reach points-of-no-return), and reliability (system can always continue functioning using a good trace). A good trace is an infinite trace in which goals are perpetually satisfied. We propose a class of systems called Progressing Timed Systems (PTS), where intuitively only a finite number of actions can be carried out in a bounded time period. We prove that for this class of systems the problems of realizability, recoverability, reliability, and survivability are PSPACE-complete. Furthermore, if we impose a bound on time (as in bounded model-checking), we show that for PTS, realizability becomes NP-complete, while survivability and reliability problems are in the $\Delta_2^p$ class of the polynomial hierarchy.

### Two PhD/Postdoc Positions in Algorithms and Complexity Theory at Goethe-University of Frankfurt (apply by June 15, 2021)

from CCI: jobs

The research group conducts research on fundamental questions of computation and information, is driven by curiosity, and provides a friendly, open-minded, and positive social environment. Potential research topics include algebraic graph algorithms, the theory of machine learning on graphs, circuit complexity, pseudorandomness, fine-grained and parameterized complexity. Includes some teaching.

Website: https://tcs.uni-frankfurt.de/positions/
Email: tcs-applications@dlist.uni-frankfurt.de

by shacharlovett at May 10, 2021 03:29 PM UTC

from Scott Aaronson

1. For those who read my reply to Richard Borcherds on “teapot supremacy”: seeking better data, I ordered a dozen terra cotta flowerpots, and smashed eight of them on my driveway with my 4-year-old son, dropping each one from approximately 2 meters. For each flowerpot, we counted how many pieces it broke into, seeking insight about the distribution over that number. Unfortunately, it still proved nearly impossible to get good data, for a reason commenters had already warned me about: namely, there were typically 5-10 largeish shards, followed by “long tail” of smaller and smaller shards (eventually, just terra cotta specks), with no obvious place to draw the line and stop counting. Nevertheless, when I attempted to count only the shards that were “fingernail-length or larger,” here’s what I got: 1 pot with 9 shards, 1 with 11 shards, 2 with 13 shards, 2 with 15 shards, 1 with 17 shards, 1 with 19 shards. This is a beautiful (too beautiful?) symmetric distribution centered around a mean of 14 shards, although it’s anyone’s guess whether it approximates a Gaussian or something else. I have no idea why every pot broke into an odd number of shards, unless of course it was a 1-in-256-level fluke, or some cognitive bias that made me preferentially stop counting the shards at odd numbers.

### Congratulations, Dr. Matias!

from David Eppstein

Pedro Ascensao Ferreira Matias, one of the students working with Mike Goodrich in the UC Irvine Center for Algorithms and Theory of Computation, passed his Ph.D. defense today.

Pedro is Portuguese, and came to UCI after a bachelor’s degree from the University of Coimbra in Portugal and a master’s degree from Chalmers University of Technology in Sweden.

The general topic of Pedro’s research is “exact learning”, the inference of structured information from queries or other smaller pieces of data. I’ve written here before about my work with Matias on nearest-neighbor chains and on tracking paths in planar graphs, the problem of placing sensors on a small subset of vertices so that, by detecting the order in which a path reaches each sensor, you can uniquely determine the whole path. His dissertation combines the tracking paths work with a second paper on tracking paths (“How to Catch Marathon Cheaters: New Approximation Algorithms for Tracking Paths”, arXiv:2104.12337, to appear at WADS 2021), and a paper on reconstructing periodic and near-periodic strings from sublinear numbers of queries (“Adaptive Exact Learning in a Mixed-Up World: Dealing with Periodicity, Errors and Jumbled-Index Queries in String Reconstruction”, arXiv:2007.08787, in SPIRE 2020). He also has recent papers on reconstructing trees in SPAA 2020 and ESA 2020.

After finishing his doctorate, Pedro’s next position will be working for Facebook.

Congratulations, Pedro!

by David Eppstein at May 07, 2021 02:50 PM UTC

### Faculty Full Professor at University of Bamberg, Germany (apply by June 18, 2021)

from CCI: jobs

The Faculty of Information Systems and Applied Computer Sciences invites applications for the position of Full Professor (W3 level) in Algorithms and Complexity Theory with a focus on algorithms and complexity theory for distributed and concurrent software systems as well for the acquisition, processing and visualisation of data in networked systems.

Website: https://www.uni-bamberg.de/abt-personal/stellenausschreibung/professuren/
Email: michael.mendler@uni-bamberg.de

by shacharlovett at May 07, 2021 02:12 PM UTC

### Fair Clustering with Probabilistic Group Membership

from TOC for Fairness

This post briefly summarizes a NeurIPS-20 paper, Probabilistic Fair Clustering, which I coauthored with Brian Brubach, Leonidas Tsepenekas, and John P. Dickerson.

Clustering is possibly the most fundamental problem of unsupervised learning. Like many other paradigms of machine learning, there has been a focus on fair variants of clustering. Perhaps the definition which has received the most attention is the group fairness definition of [1]. The notion is based on disparate impact and simply states that each cluster should contain points belonging to the different demographic groups with “appropriate” proportions. A natural interpretation of appropriate would imply that each demographic group appears in close to population-level proportions in each cluster. More specifically, if we were to endow each point with a color $h \in {\cal H}$ to designate its group membership and we were to consider the $k$-means clustering objective, then this notion of fair clustering amounts to the following constrained optimization problem:

\begin{aligned} & \text{min} \sum_{j \in C_i} \sum_{i \in \lbrack k\rbrack } d(j,\mu_i)^2 \\ & \text{s.t. }\forall i \in S, \forall h \in \mathcal{H}: l_h |C_i| \leq |C^h_i| \leq u_h |C_i| \end{aligned}

Here, $l_h$ and $u_h$ are the lower and upper pre-set proportionality bounds for color $h$, $C_i$ denotes the points in cluster $i$, and $C^h_i$ denotes the subset of those points with color $h$. See figure 1 for a comparison between the outputs of color-agnostic and fair clustering.

If one were to use clustering for market segmentation and targeted advertisement, then the above definition of fair clustering would roughly ensure that each demographic group receives the same exposure to every type of ad. Similarly if we were to cluster news articles and let the source of each article indicate its membership then we could ensure that each cluster has a good mixture of news from different sources [2].

Significant progress has been made in this notion of fair clustering starting from only considering the two color case and under-representation bounds, to the multi-color case with both under- and over-representation bounds [3.4.5]. Scalable methods for larger datasets have also been proposed [6, 7].

Clearly, like the majority of the methods in group-fair supervised learning, it is assumed that the group membership of each point in the dataset is known. This setting conflicts with a common situation in practice where group memberships are either imperfectly known or completely unknown [8,9,10,11,12]. We take the first step in generalizing fair clustering to this setting; specifically, we assume that while we do not know the exact group membership of each point, we instead have a probability distribution over the group memberships. A natural generalization of the previous optimization problem would be the following:

\begin{aligned} & \text{min} \sum_{j \in C_i} \sum_{i \in \lbrack k\rbrack } d(j,\mu_i)^2 \\ & \text{s.t. }\forall i \in S, \forall h \in \mathcal{H}: l_h |C_i| \leq \mathbb{E}|C^h_i| \leq u_h |C_i| \end{aligned}

Where the proportionality constraints were simply changed to hold in expectation instead of deterministically. Clearly, this constraint reduces to the original constraint when the group memberships are completely known. Figure 2 helps visualize how the input to probabilistic fair clustering looks like and the output we expect.

Despite the innocuous modification to the constraint, the problem becomes significantly more difficult. In our paper, we consider the center-based clustering objectives of $k$-center, $k$-median, and $k$-means and produce solutions with approximation ratio guarantees for two given cases:

• Two-Color Case: We see that even the two color case is not easy to handle. The key difficulty lies in the rounding method. However, we give a rounding method that maintains the fairness constraint with a worst-case additive violation of 1 matching the deterministic fair clustering case.
• Multi-Color Case with Large Enough Clusters: At a high level, if the clusters have a sufficiently large size then through a Chernoff bound we can show that independent sampling would result in a deterministic fair clustering instance which we could solve using deterministic fair clustering algorithms. This essentially forms a reduction from the probabilistic to the deterministic instance.

While our solutions perform well empirically, we are left with a collection of problems. For example, guaranteeing that the color proportions are maintained in expectation is not the best constraint one should hope for, since when the colors are realized a cluster could entirely consist of one color. A more preferable constraint would instead bound the probability of obtaining an “unfair” clustering. Moreover, a setting that assumes access to the probability distribution for a given point over all colors could still be assuming too much. A more reasonable setting could instead take a robust-optimization-based approach, where we have the distribution of each point but allow the distribution of each point to belong to an uncertainty set. This effectively allows our probabilistic knowledge to be imperfect as well—as could be the case if, for example, a machine learning model were predicting group membership with a systematic bias against a particular subset of colors. Lastly, being able to handle the multi-color case in an assumption-free manner would also be interesting.

References:

1. Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvitskii. Fair clustering through fairlets. In Advances in Neural Information Processing Systems, 2017.
2. Sara Ahmadian, Alessandro Epasto, Ravi Kumar, and Mohammad Mahdian. Clustering without over-representation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019.
3. Melanie Schmidt, Chris Schwiegelshohn, and Christian Sohler. Fair coresets and streaming algorithms for fair k-means. In the International Workshop on Approximation and Online Algorithms, 2019.
4. Ioana O. Bercea, Martin Groß, Samir Khuller, Aounon Kumar, Clemens Rösner, Daniel R. Schmidt, Melanie Schmidt. On the cost of essentially fair clusterings, In the International Conference on Approximation Algorithms for Combinatorial Optimization Problems 2019.
5. Suman Bera, Deeparnab Chakrabarty, Nicolas Flores, and Maryam Negahbani. Fair algorithms for clustering. In Advances in Neural Information Processing Systems, 2019.
6. Arturs Backurs, Piotr Indyk, Krzysztof Onak, Baruch Schieber, Ali Vakilian, and Tal Wagner. Scalable fair clustering. In the International Conference on Machine Learning, 2019.
7. Lingxiao Huang, Shaofeng Jiang, and Nisheeth Vishnoi. Coresets for clustering with fairness constraints. In Advances in Neural Information Processing Systems, 2019.
8. Pranjal Awasthi, Matth¨aus Kleindessner, and Jamie Morgenstern. Equalized odds postprocessing under imperfect group information. In the International Conference on Artificial Intelligence and Statistics, 2020.
9. Preethi Lahoti, Alex Beutel, Jilin Chen, Kang Lee, Flavien Prost, NithumThain, Xuezhi Wang, and Ed Chi. Fairness without demographics through adversarially reweighted learning. In Advances in Neural InformationProcessing Systems, 2020.
10. David Pujol, Ryan McKenna, Satya Kuppam, Michael Hay, AshwinMachanavajjhala, and Gerome Miklau. Fair decision making using privacy-protected data. In Proceedings of the Conference on Fairness, Accountability, and Transparency, 2020.
11. Hussein Mozannar, Mesrob Ohannessian, and Nathan Srebro. Fair learning with private demographic data. In the International Conference on Machine Learning, 2020.
12. Nathan Kallus, Xiaojie Mao, and Angela Zhou. Assessing algorithmic fairness with unobserved protected class using data combination. Management Science, 2021.

by seyed2357 at May 07, 2021 02:09 PM UTC

### Ph.D. Student at IDSIA, USI-SUPSI, Lugano, Switzerland (apply by June 30, 2021)

from CCI: jobs

IDSIA opens two 4-year Ph.D. positions, starting on November 2021, in the area of algorithms and complexity, with a focus on approximation algorithms.
The gross year salary is around 50K CHF. Candidates should hold a Master Degree in Computer Science or related areas.
The interested candidates should email Prof. Fabrizio Grandoni a detailed CV and contact details of 2-3 references.

Website: https://people.idsia.ch//~grandoni/
Email: fabrizio@idsia.ch

by shacharlovett at May 07, 2021 09:47 AM UTC

### Alef Corner: ICM2022

from Gil Kalai

Alef’s new piece for ICM 2022 will surely cheer you up!

by Gil Kalai at May 06, 2021 07:45 PM UTC

### ALT Highlights – A Report on the First ALT Mentoring Workshop

Welcome to ALT Highlights, a series of blog posts spotlighting various happenings at the recent conference ALT 2021, including plenary talks, tutorials, trends in learning theory, and more! To reach a broad audience, the series will be disseminated as guest posts on different blogs in machine learning and theoretical computer science. This initiative is organized by the Learning Theory Alliance, and overseen by Gautam Kamath. All posts in ALT Highlights are indexed on the official Learning Theory Alliance blog.

This is the fifth post in the series, coverage of the first ALT Mentoring Workshop organized by the Learning Theory Alliance, written by Keziah Naggita and Sutanu Gayen.

## 1 Introduction

The Learning Theory Alliance (Let-All) is an online initiative aimed at developing a supportive learning theory community, founded by (1) Surbhi Goel, a postdoctoral researcher at Microsoft Research New York, (2) Nika Haghtalab, an assistant professor at UC Berkeley EECS, (3) and Ellen Vitercik, a Ph.D. Student at CMU; and advised by Peter Bartlett, Avrim Blum, Stefanie Jegelka, Po-Ling Loh, and Jenn Wortman Vaughan. The goal of the alliance is to ensure healthy community growth by fostering inclusive community engagement and encouraging active contributions from researchers at all stages of their careers. Let-All’s efforts towards realizing these goals include a series of ongoing and future activities, such as the first ALT mentoring workshop, coordinating the ALT Highlights blog series, and other upcoming community initiatives. This article reports on  Let-All’s first Mentoring Workshop, which was affiliated with the 32nd International Conference on Algorithmic Learning Theory.

The workshop had two main sessions to cater to the time zone differences of the participants.  These sessions had three main components: an academic program, which included how-to-talks, Ask Me Anythings (AMAs), and presentation dissections; a technical program, which included research talks; and a social program, which included discussion tables and other activities.

The workshop participants included students, researchers, and industry professionals, all at different levels of familiarity with learning theory. Because of the ongoing COVID-19 pandemic, the workshop was virtual. It was held on the online platforms Zoom and Gather town, a virtual interactive environment that mimics an in-person workshop setting. For accessibility, the workshop organizers opened up the workshop free of cost to all registered participants.

## 2 Program Highlights

To kick off the workshop, one of the organizers began with a welcome lecture: Surbhi in session one and Nika in session two.  They read out the code of conduct and who to contact in case of issues, outlined the workshop’s purpose, and gave attendees demographic information. They explained how participants could navigate the workshop-themed Gather town workspace and then ended the introduction with encouragement for participants to mingle.

The How-to-Talks sessions covered writing papers, giving talks, and networking. In Session 1, Pravesh Kothari talked in great detail about the dos and don’ts of what to add in the abstract, overview, introduction, and appendix when advising participants on how to best structure research papers. He told attendees to always put effort into understanding their intended reader or talk audience.  Pravesh encouraged attendees to consider the expertise and interests of the reader or listener to capture their attention since these highly determine the attention span and interest in the information presented to them.  He strongly recommended attendees watch the Leadership Lab: The Craft of Writing Effectively by Larry McEnerney , Director of the University of Chicago Writing Program. In session 2 of the workshop, Rafael Frongillo, similar to Pravesh, discussed how to capture the intended audience when one writes a paper, reviews, and talks.

In the first networking session, Jacob Abernethy encouraged participants to seek out horizontal and vertical networking, for example, through collaborations, talks, and reach outs. He said that currently, in academia, Ph.D. admissions, faculty hiring, and tenure appointments are heavily risk-averse. Therefore, people seek out candidates based on their network. For this reason, it is crucial for students to network from early on in their careers. He gave great examples of how junior researchers can reach out and forge relationships with other researchers. For example, when you meet academics, faculty/postdocs at events, ask to give a talk at their lab. Jacob also candidly talked about his earlier failures at MIT and how they shaped his journey. He talked about luck and how John Langford, who was at Toyota Technological Institute at Chicago at the time, took a chance on him that forever changed his life. Jacob, therefore, advised academics to take chances on people as this would change the course of the field.

Jamie Morgenstern discussed different networking methods in the second How-to-talks session. She emphasized that for junior researchers, it’s important to attend conferences and to network with others, to advertise their research through talks, and to reach out to faculty for collaboration. To introduce oneself and capture the listener’s attention, Jamie said, for conferences, prepare to do so in two minutes, for social four minutes, bar 12 minutes, and faculty interview 25 minutes. Senior grad students may help introduce the juniors during lunch/poster sessions. Finally, when emailing faculty about research, she said one should avoid discussion about other people’s work and instead should stick to the recipient’s work – “showing deep understanding and possibly open questions which might lead to collaboration.”

In both workshop sessions, there were two parallel talk dissections, in which senior faculty members gave both positive and constructive feedback on talks junior researchers presented. In the first session, Bobby Kleinberg discussed Emily Diana‘s talk titled “Minimax and Lexicographically Fair Learning: Algorithms, Experiments, and Generalization”. He highlighted parts that were impressive, those that needed improvement, and gave general advice on structuring an audience-based presentation. When Bobby suggested including more diagrams than text, a few people made suggestions of free tools including tikz, matcha.io, PowerPoint, and draw.io. In parallel, Kamalika Chaudhuri dissected a talk on “Efficient, Noise-Tolerant, and Private Learning via Boosting” by Marco Carmosino. Two main takeaways of this talk dissection were the balance of technical and nontechnical content (e.g., explaining ideas with fun pictures, etc.) and having one main and clear idea as the talk’s takeaway.

In the second session, Mingda Qiao gave a talk titled: “Stronger Calibration Lower Bounds via Sidestepping” which Praneeth Netrapalli dissected. Praneeth remarked that theory folks often jump into the problem straight away without covering much background. In conferences, this might be fine due to time pressure and specific interests. However, in broader settings such as departmental seminars, he advised the speaker to allocate more time to introduce the problem lucidly and concisely. In parallel, Mary Wootters dissected a talk titled “List-Decodable Subspace Recovery: Dimension Independent Error in Polynomial Time” that Ainesh Bakshi presented.

In the first AMA session moderated by Aaditya Ramdas, Lester Mackey refreshingly answered several of the attendees’ well-curated questions about what makes strong collaborations, how to get into grad school, and whether or not he ever felt like quitting his Ph.D., among others.  He encouraged students to take classes with professors they are interested in as it makes it easy to ask for a mentorship opportunity. Lester talked about collaborations and imposter syndrome and encouraged attendees to look on the brighter side of things, to remember that we all are working towards one big goal, creating positive changes in the world. Therefore if someone discovers a result before us, we should applaud them, collaborate if possible, and move onto new problems. He said he did not necessarily plan to do a Ph.D. but got into it towards the end of his undergraduate degree due to an internship that made him fall in love with doing research.

In the evening, there was an AMA session with Shafi Goldwasser moderated by Nika. Shafi gave thoughtful and candid answers to attendees’ captivating questions about research, life in academia, collaborations, among others. Shafi told attendees that healthy competition, trust, and overlap of research interest, is crucial for successful research in the early stage of the career. She also asserted that fundamental science is always impactful. She mentioned that the high points of her career were working on problems she was curious about: cryptography, pseudo-randomness, and zero-knowledge proofs. Finally, when asked about what advice she wished she had during the early stage of her career, interestingly Shafi replied: “having good colleagues, good friends at work, very important, most important – having a listening, promoting and supportive cohort of friends rather than an individualistic path as a scholar is priceless.”

### 2.2 Technical Program

Two research talks happened in the first session. First, Po-Ling Loh gave a talk titled “Mean estimation for entangled single-sample distributions.” Then Vatsal Sharan talked about “Sample Amplification: Increasing Dataset Size even when Learning is Impossible”.
Similarly in the second session, Nadav Cohen gave the first talk about tensor and matrix completion problems and the importance of understanding the theory behind deep learning from theoretical and practical perspectives. After, Zhiyi Huang gave a talk titled “Setting the Sample Complexity of Single-parameter Revenue Maximization.”

### 2.3 Social Program

In both sessions, during the social hours, Sumegha Garg, Suriya Gunasekar, and Thodoris Lykouris organized the table topics to help attendees meet and interact with senior researchers and professors on different topics. The table topics included the following; starting on ML research, research agendas, ML+X: multidisciplinary research, advisor-advisee relationships, collaborators, communicating research and networking, beyond your institution: internships and research visits, planning after grad school: academia versus industry, Grad school applications, Work ethics, and Open research discussion.

The table topics were chaired by; Jacob Abernethy, Shivani Agarwal, Eric Balkanski, Peter Bartlett, Avrim Blum, Sébastien Bubeck, Kamalika Chaudhuri, Nadav Cohen, Sumegha Garg, Surbhi Goel, Suriya Gunasekar, Nika Haghtalab, Daniel Hsu, Prateek Jain, Mike Jordan, Sham Kakade, Adam Kalai, Pravesh Kothari, Akshay Krishnamurthy, Jerry Li, Po-Ling Loh, Thodoris Lykouris, Yishay Mansour, Pasin Manurangsi, Vidya Muthukumar, Praneeth Netrapalli, Wen Sun, Bo Waggoner, Manolis Zampetakis, and Cyril Zhang

Lastly, at the end of the two general research talks in sessions one and two of the workshop, attendees assembled on Gather town to close the workshop. The social event included 1:1 social interactions with other attendees, an attempt at forging relationships, and activities like dancing.

## 3 Attendance Statistics, Testimonials and Feedback

### 3.1 Participants Statistics

ALT mentoring workshop welcomed talented academics, researchers, and professionals from a wide array of backgrounds. Of the 438 registered to attend the workshop, 197 were new to the learning theory community, 37 attended at least one ALT/COLT conference in the past, and 146 hadn’t attended ALT/COLT but had attended machine learning conferences (STOC, NeurIPS, etc.) as shown in Figure 1.

The workshop participants came from different parts of the world, were of different genders, races, and seniority levels. We use Figures 2, 3b, 4a, and 4b to highlight this demographic information about the participants. Some participants chose session one, and others chose session two, a choice driven by their schedules and time zones. The attendance composition is as shown in figure 3a.

### 3.2 Testimonials and Feedback

In this section, we give a recount of testimonials from participants who we interviewed after the workshop. We also highlight some of the common themes in feedback from participants.

In general, participants loved the content delivered in the sessions. They said it was informative, intuitive, and rare to find. Several participants loved interacting with peers and senior members and wished they had more time and activities to do it. The How-to-talks session (focused on networking skills, structuring papers, talks, and reviews) was the most popular session among attendees. 76.7% of the survey respondents said the session helped them gain new technical skills or hone existing skills, see opportunities in academia and how to use them, and see barriers in academia and ways to overcome them. Figure 5 highlights attendees’ ratings of the skills acquired from the workshop.

Below is a recount of the workshop experiences of the interviewed attendees.

“My highlight was the How-to-Talks since they provided a lot of more personal information/inputs that you cannot easily find online and which was very valuable. The event helped me to remember and reflect upon which qualities are crucial to becoming a good researcher. I even made a list in a place that I see every day to keep them in mind.” – Michael Aerni, MSc student at ETH Zurich.

“The workshop highlights for me were the How-to-talks, the AMA session with Lester, and the social tables. The How-to-talks were extremely valuable as they discussed topics such as structuring papers and networking in the community. These are subtle aspects that are not often explicitly talked about in the community. I, therefore, learned a lot from them. The AMA session was refreshingly honest and open. Finally, the social tables were also great as I got to meet and talk to some well-established senior community members like Sebastien Bubeck, Shivani Agarwal, and Akshay Krishnamurthy.” – Tanmay Gautam, a second-year Ph.D. student at UC Berkeley.

“It was awesome to have Lester Mackey answer my questions, indubitably. I learned about staying true to the research questions I genuinely believe in regardless of external opinions and rewards. As an NYU AI School organizer, I can appreciate how much effort went into organizing the workshop. The organizers did a stellar job! I think a version of the same event again would be perfect.” – Swapneel Mehta, a Data Science Ph.D. student at New York University.

“My highlights were getting to talk with senior members of the community. These opportunities rarely come by for someone who lives in a foreign country. For a timid person like myself, I am also thankful for the senior members for helping me (and other participants) breaking the ice and easing us into the conversations. Thanks to this event, I am now more confident in engaging with other researchers.”- Donlapark Ponnoprat, a Statistics lecturer at Chiang Mai University.

“My main highlights were Dr. Goldwasser’s session with Dr. Haghtalab and the socials. In the socials, I was able to ask professors and senior researchers for advice on varied topics based on the tables. Insights from Dr. Cohen’s lecture on tensor rank and implicit regularisation gave me several pointers to ideas in the literature that I was not aware of as a junior researcher from a slightly different AI specialty. These ideas might be beneficial for my research in the long term.

I send a sincere thank you to all the organizers. The mentorship workshop was a great event, and it models concrete actions, what it means to foster a welcoming community. It is clear how kind and dedicated folks are here as some researchers even stayed beyond midnight in their time zones to answer questions that attendees had. If an event like this happens again, I am most definitely signing up to come.” – Esra’a Saleh, a Masters in Computer Science student at the University of Alberta, affiliated with AMII and RLAI.

“My main takeaway from the event was an inside look at academia. As an undergrad, my only experience in academia has been the little experience I have with my advisory professors. While this is an invaluable experience, this event was nice as it was one of the very few that cater to students, including undergrads, with the intent of bringing them into the academia fold. Getting to know new people and talking to them was extremely interesting, especially during lockdown when connecting with others is a much more valuable commodity.” – Shrey Shah, a penultimate year undergraduate student at the University of Wisconsin-Madison.

Several participants enjoyed the workshop sessions and hoped that Let-All holds more similar themed workshops in conferences. Attendees suggested ways for attendees to interact more with each other. Some of the suggestions included the following: ‘beginner-friendly open problems sessions where attendees can collaborate’ – Esra’a Saleh, ‘an icebreaker session at the beginning that encourages attendees to mingle’ – Michael Aerni, and ‘a poster session for participants to present their work’ – Shrey Shah.

## 4 Conclusion

The ALT mentorship workshop organized by the Learning Theory Alliance brought together many academics and researchers. It was, and we hope it continues to be, an opportunity for the budding researchers to learn about research and meta-research, forge collaborations, and be inspired. Kudos to the organizers and the Alliance in general for dreaming such a positive vision and then striving to make it a great success!

Thanks to Gautam Kamath, Margalit Glasgow, Surbhi Goel, Nika Haghtalab and Ellen Vitercik for helpful conversations and comments.

by Keziah at May 06, 2021 03:44 PM UTC