# Theory of Computing Blog Aggregator

### Senior Lecturer/Associate Professor at The University of Melbourne (apply by November 10, 2021)

from CCI: jobs

The Faculty of Engineering and Information Technology (FEIT) is seeking a dynamic academic with expertise in algorithms and their fairness, or related fields, to join the School of Computing and Information Systems (CIS).

Website: https://jobs.unimelb.edu.au/en/job/906606/senior-lecturerassociate-professor-in-algorithms-and-fairness
Email: awirth@unimelb.edu.au

by shacharlovett at October 27, 2021 05:58 AM UTC

### Multiple open-rank tenure track positions at University of Utah (apply by January 15, 2022)

from CCI: jobs

The School of Computing at the University of Utah seeks applications for tenure-track/tenured faculty in all areas of computer science/computing, and at all ranks, to fill approximately nine available positions.

Website: https://www.cs.utah.edu/faculty-hiring/#tenure

by shacharlovett at October 26, 2021 08:27 PM UTC

### Postdoc in combinatorial optimization at University of Copenhagen (apply by November 28, 2021)

from CCI: jobs

The CS department at the University of Copenhagen invites applications for postdoc positions in combinatorial optimization. The application deadline is November 28, 2021. See http://www.jakobnordstrom.se/openings/Postdoc-UCPH-211128.html for the full announcement with more information and instructions for how to apply. Informal enquiries are welcome and may be sent to jakob.nordstrom@cs.lth.se.

Website: http://www.jakobnordstrom.se/openings/Postdoc-UCPH-211128.html
Email: jakob.nordstrom@cs.lth.se

by shacharlovett at October 26, 2021 08:22 PM UTC

### Associate Professor at Oxford University (apply by January 28, 2022)

from CCI: jobs

As part of an expansion, the Department of Computer Science at Oxford is aiming to recruit 6 new faculty members, including one in Algorithms and Complexity Theory. The appointments will be at the level of Associate Professor or full Professor.

Email: Paul.Goldberg@cs.ox.ac.uk

by shacharlovett at October 26, 2021 10:57 AM UTC

### Refuting Tianrong Lin's arXiv:2110.05942 "Resolution of The Linear-Bounded Automata Question''

Authors: Thomas Preu
Abstract: In the preprint mentioned in the title Mr. Tianrong claims to prove $\textrm{NSPACE}[n]\neq\textrm{DSPACE}[n]$, resolving a longstanding open problem in automata theory called the LBA question. He claims to achieve this by showing more generally $\textrm{NSPACE}[S(n)]\neq\textrm{DSPACE}[S(n)]$ for suitable $S(n)$. We demonstrate that his proof is incomplete, even wrong, and his strategy cannot be repaired.

### Approximating LCS and Alignment Distance over Multiple Sequences

Authors: Debarati Das, Barna Saha
Abstract: We study the problem of aligning multiple sequences with the goal of finding an alignment that either maximizes the number of aligned symbols (the longest common subsequence (LCS)), or minimizes the number of unaligned symbols (the alignment distance (AD)). Multiple sequence alignment is a well-studied problem in bioinformatics and is used to identify regions of similarity among DNA, RNA, or protein sequences to detect functional, structural, or evolutionary relationships among them. It is known that exact computation of LCS or AD of $m$ sequences each of length $n$ requires $\Theta(n^m)$ time unless the Strong Exponential Time Hypothesis is false. In this paper, we provide several results to approximate LCS and AD of multiple sequences.

If the LCS of $m$ sequences each of length $n$ is $\lambda n$ for some $\lambda \in [0,1]$, then in $\tilde{O}_m(n^{\lfloor\frac{m}{2}\rfloor+1})$ time, we can return a common subsequence of length at least $\frac{\lambda^2 n}{2+\epsilon}$ for any arbitrary constant $\epsilon >0$.

It is possible to approximate the AD within a factor of two in time $\tilde{O}_m(n^{\lceil\frac{m}{2}\rceil})$. However, going below-2 approximation requires breaking the triangle inequality barrier which is a major challenge in this area. No such algorithm with a running time of $O(n^{\alpha m})$ for any $\alpha < 1$ is known. If the AD is $\theta n$, then we design an algorithm that approximates the AD within an approximation factor of $\left(2-\frac{3\theta}{16}+\epsilon\right)$ in $\tilde{O}_m(n^{\lfloor\frac{m}{2}\rfloor+2})$ time. Thus, if $\theta$ is a constant, we get a below-two approximation in $\tilde{O}_m(n^{\lfloor\frac{m}{2}\rfloor+2})$ time. Moreover, we show if just one out of $m$ sequences is $(p,B)$-pseudorandom then, we get a below-2 approximation in $\tilde{O}_m(nB^{m-1}+n^{\lfloor \frac{m}{2}\rfloor+3})$ time irrespective of $\theta$.

### A randomized quantum algorithm for statistical phase estimation

Authors: Kianna Wan, Mario Berta, Earl T. Campbell
Abstract: Phase estimation is a quantum algorithm for measuring the eigenvalues of a Hamiltonian. We propose and rigorously analyse a randomized phase estimation algorithm with two distinctive features. First, our algorithm has complexity independent of the number of terms L in the Hamiltonian. Second, unlike previous L-independent approaches, such as those based on qDRIFT, all sources of error in our algorithm can be suppressed by collecting more data samples, without increasing the circuit depth.

### Multiple open rank tenure track faculty positions at University of Illinois, Urbana-Champaign (UIUC) (apply by December 1, 2021)

from CCI: jobs

While we welcome applications from exceptional candidates in all areas, we particularly encourage applications from candidates working in quantum computing, cloud computing and data centers, interactive computing (HCI, HRI, wearable computing, AR/VR, graphics), the intersection of systems/architecture and artificial intelligence/machine learning, and the social impacts of computing.

Email: chekuri@illinois.edu

by shacharlovett at October 25, 2021 07:17 PM UTC

### How to deploy machine learning with differential privacy?

In many applications of machine learning, such as machine learning for medical diagnosis, we would like to have machine learning algorithms that do not memorize sensitive information about the training set, such as the specific medical histories of individual patients. Differential privacy is a notion that allows quantifying the degree of privacy protection provided by an algorithm on the underlying (sensitive) data set it operates on. Through the lens of differential privacy, we can design machine learning algorithms that responsibly train models on private data.

## Why do we need private machine learning algorithms?

Machine learning algorithms work by studying a lot of data and updating their parameters to encode the relationships in that data. Ideally, we would like the parameters of these machine learning models to encode general patterns (e.g., ‘‘patients who smoke are more likely to have heart disease’’) rather than facts about specific training examples (e.g., “Jane Smith has heart disease”). Unfortunately, machine learning algorithms do not learn to ignore these specifics by default. If we want to use machine learning to solve an important task, like making a cancer diagnosis model, then when we publish that machine learning model (for example, by making an open source cancer diagnosis model for doctors all over the world to use) we might also inadvertently reveal information about the training set. A malicious attacker might be able to inspect the published model’s predictions and learn private information about Jane Smith. For instance, the adversary could mount a membership inference attack to know whether or not Jane Smith contributed her data to the model’s training set [SSS17]. The adversary could also build on membership inference attacks to extract training data by repeatedly guessing possible training points until they result in a sufficiently strong membership signal from the model’s prediction [CTW20]. In many instances, the model itself may be represented by a few of the data samples (e.g., Support Vector Machine in its dual form).

A common misconception is that if a model generalizes (i.e., performs well on the test examples), then it preserves privacy. As mentioned earlier, this is far from being true. One of the main reasons being that generalization is an average case behavior of a model (over the distribution of data samples), whereas privacy must be provided for everyone, including outliers (which may deviate from our distributional assumptions).

Over the years, researchers have proposed various approaches towards protecting privacy in learning algorithms (k-anonymity [SS98], l-diversity [MKG07], m-invariance [XT07], t-closeness [LLV07] etc.). Unfortunately, [GKS08] all these approaches are vulnerable to what are called composition attacks, that use auxiliary information to violate the privacy protection. Famously, this strategy allowed researchers to de-anonymize part of a movie ratings dataset released to participants of the Netflix Prize when the individuals had also shared their movie ratings publicly on the Internet Movie Database (IMDb) [NS08]. If Jane Smith had assigned the same ratings to movies A, B and C in the Netflix Prize dataset and publicly on IMDb at similar times, then researchers could link data corresponding to Jane across both datasets. This would in turn give them the means to recover ratings that were included in the Netflix Prize but not on IMDb. This example shows how difficult it is to define and guarantee privacy because it is hard to estimate the scope of knowledge—about individuals—available to adversaries. While the dataset released by Netflix has since been taken down, it is difficult to ensure that all of its copies have been deleted. In recent years, data sample instance encoding based methods like InstaHide [HSL20], and NeuraCrypt [YEO21] have been demonstrated to be vulnerable to such composition attacks as well.

As a result, the research community has converged on differential privacy [DMNS06], which provides the following semantic guarantee, as opposed to ad-hoc approaches: An adversary learns almost the same information about an individual whether or not they are present in or absent from the training data set. In particular, it provides a condition on the algorithm, independent from who might be attacking it, or the specifics of the data set instantiation. Put another way, differential privacy is a framework for evaluating the guarantees provided by a system that was designed to protect privacy. Such systems can be applied directly to “raw” data which potentially still contains sensitive information, altogether removing the need for procedures that sanitize or anonymize data and are prone to the failures described previously. That said, minimizing data collection in the first place remains a good practice to limit other forms of privacy risk.

## Designing Private Machine Learning Algorithms via Differential Privacy

Differential privacy [DMNS06] is a semantic notion of privacy that addresses a lot of the limitations of previous approaches like k-anonymity. The basic idea is to randomize part of the mechanism’s behavior to provide privacy. In our case, the mechanism considered is a learning algorithm, but the differential privacy framework can be applied to study any algorithm.

The intuition for why we introduce randomness into the learning algorithm is that it obscures the contribution of an individual, but does not obscure important statistical patterns. Without randomness, we would be able to ask questions like: “What parameters does the learning algorithm choose when we train it on this specific dataset?” With randomness in the learning algorithm, we instead ask questions like: “What is the probability that the learning algorithm will choose parameters in this set of possible parameters, when we train it on this specific dataset?”

We use a version of differential privacy which requires (in our use case of machine learning) that the probability of learning any particular set of parameters stays roughly the same if we change a single data record in the training set. A data record can be a single training example from an individual, or the collection of all the training examples provided by an individual. The former is often referred to as example level/item level privacy, and the latter is referred to as user level differential privacy. While user level privacy provides stronger semantics, it may be harder to achieve. For a more thorough discussion about the taxonomy of these notions, see [DNPR10, JTT18, HR12, HR13]. In this document, for the ease of exposition of the technical results, we focus on the example level notion. This could mean to add a training example, remove a training example, or change the values within one training example. The intuition is that if a single patient (Jane Smith) does not affect the outcome of learning much, then that patient’s records cannot be memorized and her privacy is respected. In the rest of this post, how much a single record can affect the outcome of learning is called the sensitivity of an algorithm.

The guarantee of differential privacy is that the adversary is not able to distinguish the answers produced by the randomized algorithm based on the data of two of the three users from the answers returned by the same algorithm based on the data of all three users. We also refer to the degree of indistinguishability as the privacy loss. Smaller privacy loss corresponds to a stronger privacy guarantee.

It is often thought that privacy is a fundamental bottleneck in obtaining good prediction accuracy/generalization for machine learning algorithms. In fact, recent research has shown that in many instances it actually helps in designing algorithms with strong generalization ability. Some of the examples where DP has resulted in designing better learning algorithms are Online linear predictions [KV05] and online PCA [DTTZ13]. Notably, [DFH15] formally showed that generalization for any DP learning algorithm comes for free. More concretely, if a DP learning algorithm has good training accuracy, it is guaranteed to have good test accuracy. This is true because differential privacy itself acts as a very strong form of regularization.

One might argue that the generalization guarantee which a DP algorithm can achieve may be sub-par to that of its non-private baselines. For a large class of learning tasks, one can show that asymptotically DP does not introduce any further error beyond the inherent statistical error [SSTT21]. [ACG16,BFTT19] highlights that in the presence of enough data, a DP algorithm can get arbitrarily close to the inherent statistical error, even under strong privacy parameters.

## Private Empirical Risk Minimization

Before we go into the design of specific differentially private learning algorithms, we first formalize the problem setup, and standardize some notation. Consider a training data set $D={(x_1,y_1),…,(x_n,y_n)}$ drawn i.i.d. from some fixed (unknown) distribution $\Pi$, with the feature vector being $x_i$ and label/response being $y_i$. We define the training loss at any model $\theta$ as $L_{train} (\theta, D) = \frac{1}{n} \sum_{i=1}^{n} l(\theta; (x_i, y_i))$, and the corresponding test loss as $L_{test} (\theta) = E_{(x,y) \sim \Pi} l(\theta;(x,y))$. We will design DP algorithms to output models that approximately minimize the test loss while having access only to the training loss.

In the literature, there are a variety of approaches towards designing these DP learning algorithms [CMS11, KST12, BST14, PAE16, BTT18]. One can categorize them broadly as: i) algorithms that assume that the individual loss function $l(\theta;\cdot)$ is convex in the model parameter to ensure differential privacy, ii) algorithms that are differentially private even when the loss function is non-convex in nature (e.g., deep learning models), and iii) model agnostic algorithms, that do not require any information about the representation of the model $\theta$, or the loss function $l(\theta;\cdot)$. In our current discussion, we will only focus on designing algorithms for (ii), and (iii). This is because it turns out that the best known algorithms for (ii) are already competitive to algorithms that are specific for (i) [INS19].

## Private Algorithms for Training Deep Learning Models

The first approach, due to SCS13, BST14, and ACG16, is named differentially private stochastic gradient descent (DP-SGD). It proposes to modify the model updates computed by the most common optimizer used in deep learning: stochastic gradient descent (SGD). Typically, stochastic gradient descent trains iteratively. At each iteration, a small number of training examples (a “minibatch”) are sampled from the training set. The optimizer computes the average model error on these examples, and then differentiates this average error with respect to each of the model parameters to obtain a gradient vector. Finally, the model parameters ($\theta_t$) are updated by subtracting this gradient ($\nabla_t$) multiplied by a small constant $\eta$ (the learning rate controls how quickly the optimizer updates the model’s parameters). At a high level, two modifications are made by DP-SGD to obtain differential privacy: gradients, which are computed on a per-example basis (rather than averaged over multiple examples), are first clipped to control their sensitivity, and, second, spherical Gaussian noise $b_t$ is added to their sum to obtain the indistinguishability needed for DP. Succinctly, the update step can be written as follows: $\theta_{t+1} \leftarrow \theta_t - \eta \cdot (\nabla_t + b_t)$.

Let us take the example of a hospital training a model to predict whether patients will be readmitted after being released. To train the model, the hospital uses information from patient records, such as demographic variables and admission variables (e.g., age, ethnicity, insurance type, type of Intensive Care Unit admitted to) but also time-varying vitals and labs (e.g., heart rate, blood pressure, white blood cell counts) [JPS16]. The modifications made by DP-SGD ensure that if (1) Jane Smith’s individual patient record contained unusual features, e.g., her insurance provider was uncommon for people of her age or her heart rate followed an unusual pattern, the resulting signal will have a bounded impact on our model updates, and (2) the model’s final parameters would be essentially identical should Jane Smith have chosen to not contribute (i.e., opt-out) her patient record to the training set. Stronger differential privacy is achieved when one is able to introduce more noise (i.e., sample noise with larger standard deviation) and train for as few iterations as possible.

Two main components in the above DP-SGD algorithm that distinguishes itself from traditional SGD are: i) per-example clipping and ii) Gaussian noise addition. In addition, for the analysis to hold, DP-SGD requires that sub-sampling of mini batches is uniform at random from the training data set. While this is not a requirement of DP-SGD per se, in practice many implementations of SGD do not satisfy this requirement and instead analyze different permutations of the data at each epoch of training.

While gradient clipping is common in deep learning, often used as a form of regularization, it differs from that in DP-SGD as follows: The average gradient over the minibatch is clipped, as opposed to clipping the gradient of individual examples (i.e., $l(\theta_t;(x,y))$ before averaging. It is an ongoing research direction to both understand the effect of per-example clipping in DP-SGD in model training [SSTT21], and also effective ways to mitigate its impact both in terms of accuracy [PTS21], and training time [ZHS19].

In standard stochastic gradient descent, subsampling is usually used either as a way to speed up the training process [CAR16], or as a a form of regularization [RCR15]. In DP-SGD, the randomness in the subsampling of the minibatch is used to guarantee DP. The technical component for this sort of privacy analysis is called privacy amplification by subsampling [KLNRS08,BBG18]. Since the sampling randomness is used to guarantee DP, it is crucial that the uniformity in the sampling step is of cryptographic strength. Another, (possibly) counterintuitive feature of DP-SGD is that for best privacy/utility trade-off it is in general better to have larger batch sizes. In fact, full-batch DP-gradient descent may provide the best privacy/utility trade-offs, albeit at the expense of computational feasibility.

For a fixed DP guarantee, the magnitude of the Gaussian noise that gets added to the gradient updates in each step in DP-SGD is proportional to $\sqrt{the\ number\ of\ steps}$ the model is trained for. As a result, it is important to tune the number of training steps for best privacy/utility trade-offs.

In the following tutorial, we provide a small code snippet to train a model with DP-SGD.

## Model Agnostic Private Learning

The Sample and Aggregate framework [NRS07] is a generic method to add differential privacy to a non-private algorithm without caring about the internal workings of it, a.k.a. model agnostic. In the context of machine learning, one can state the main idea as follows: Consider a multi-class classification problem. Take the training data, and split into k disjoint subsets of equal size. Train independent models $\theta_1, \theta_2, …, \theta_k$ on the disjoint subsets. In order to predict on an test example x, first, compute a private histogram over the set of k predictions $\theta_1(x), \theta_2(x), …, \theta_k(x)$. Then, select and output the bin in the histogram based on the highest count, after adding a small amount of Laplace/Gaussian noise to the counts. In the context of DP learning, this particular approach was used in two different lines of work: i) PATE [PAE16], and ii) Model agnostic private learning [BTT18]. While the latter focussed on obtaining theoretical privacy/utility trade-offs for a class of learning tasks (e.g., agnostic PAC learning), the PATE approach focuses on practical deployment. Both these lines of work make one common observation. If the predictions from $\theta_1(x), \theta_2(x), …, \theta_k(x)$ are fairly consistent, then the privacy cost in terms of DP is very small. Hence, one can run a large number of prediction queries, without violating DP constraints. In the following, we describe the PATE approach in detail.

The private aggregation of teacher ensembles (PATE) demonstrated in particular that this approach allows one to learn deep neural networks with differential privacy. It proposes to have an ensemble of models trained without privacy predict with differential privacy by having these models predict in aggregate rather than revealing their individual predictions. In PATE, we start by partitioning the private dataset into smaller subsets of data. These subsets are partitions, so there is no overlap between the data included in any pair of partitions. If Jane Smith’s record was in our private dataset, then it is included in one of the partitions only. That is, only one of the teachers has analyzed Jane Smith’s record during training. We train a ML model, called a teacher, on each of these partitions. We now have an ensemble of teacher models that were trained independently, but without any guarantees of privacy. How do we use this ensemble to make predictions that respect privacy? In PATE, we add noise while aggregating the predictions made individually by each teacher to form a single common prediction. We count the number of teachers who voted for each class, and then perturb that count by adding random noise sampled from the Laplace or Gaussian distribution. Each label predicted by the noisy aggregation mechanism comes with rigorous differential privacy guarantees that bound the privacy budget spent to label that input. Again, stronger differential privacy is achieved when we are able to introduce more noise in the aggregation and are able to answer as few queries as possible. Let us now come back to our running example. Imagine that we’d like to use the output of PATE to know if Jane likes a particular movie. The only teacher trained on the partition containing Jane Smith’s data—has now learned that a record similar to Jane’s is characteristic of an individual who likes similar movies, and as a consequence changes its prediction on a test input which is similar to Jane’s to predict the movie rating assigned by Jane. However, because the teacher only contributes a single vote to the aggregation, and that the aggregation injects noise, we won’t be able to know whether the teacher changed its prediction to the movie rating assigned by Jane because the teacher indeed trained on Jane’s data or because the noise injected during the aggregation “flipped” that teacher’s vote. The random noise added to vote counts prevents the outcome of aggregation from reflecting the votes of any individual teachers to protect privacy.

## Practically deploying differential privacy in machine learning

The two approaches we introduced have the advantage of being conceptually simple to understand. Fortunately, there also exist several open-source implementations of these approaches. For instance, DP-SGD is implemented in TensorFlow Privacy, Objax, and Opacus. This means that one is able to take an existing TensorFlow, JAX, or PyTorch pipeline for training a machine learning model and replace a non-private optimizer with DP-SGD. An example implementation of PATE is also available in TensorFlow Privacy. So what are the concrete potential obstacles to deploying machine learning with differential privacy?

The first obstacle is the accuracy of privacy-preserving models. Datasets are often sampled from distribution with heavy tails. For instance, in a medical application, there are typically (and fortunately) fewer patients with a given medical condition than patients without that condition. This means that there are fewer training examples for patients with each medical condition to learn from. Because differential privacy prevents us from learning patterns which are not found generally across the training data, it limits our ability to learn from these patients for which we have very few examples of [SPG]. More generally, there is often a trade-off between the accuracy of a model and the strength of the differential privacy guarantee it was trained with: the smaller the privacy budget is, the larger the impact on accuracy typically is. That said, this tension is not always inevitable and there are instances where privacy and accuracy are synergical because differential privacy implies generalization [DFH15] (but not vice versa).

The second obstacle to deploying differentially private machine learning can be the computational overhead. For instance, in DP-SGD one must compute per-example gradients rather than average gradients. This often means that optimizations implemented in machine learning frameworks to exploit matrix algebra supported by underlying hardware accelerators (e.g., GPUs) are harder to take advantage of. In another example, PATE requires that one train multiple models (the teachers) rather than a single model so this can also introduce overhead in the training procedure. Fortunately, this cost is mostly mitigated in recent implementations of private learning algorithms, in particular in Objax and Opacus.

The third obstacle to deploying differential privacy, in machine learning but more generally in any form of data analysis, is the choice of privacy budget. The smaller the budget, the stronger the guarantee is. This means one can compare two analyses and say which one is “more private”. However, this also means that it is unclear what is “small enough” of a privacy budget. This is particularly problematic given that applications of differential privacy to machine learning often require a privacy budget that provides little theoretical guarantees in order to train a model whose accuracy is large enough to warrant a useful deployment. Thus, it may be interesting for practitioners to evaluate the privacy of their machine learning algorithm by attacking it themselves. Whereas the theoretical analysis of an algorithm’s differential privacy guarantees provides a worst-case guarantee limiting how much private information the algorithm can leak against any adversary, implementing a specific attack can be useful to know how successful a particular adversary or class of adversaries would be. This helps interpret the theoretical guarantee but may not be treated as a direct substitute for it. Open-source implementations of such attacks are increasingly available: e.g., for membership inference here and here.

## Conclusion

In the above, we discussed some of the algorithmic approaches towards differentially private model training which have been effective both in theoretical and practical settings. Since it is a rapidly growing field, we could not cover all the important aspects of the research space. Some prominent ones include: i) Choice of the best hyperparameters in the training of DP models.In order to ensure that the overall algorithm preserves differential privacy, one needs to ensure that the choice of hyperparameters itself preserves DP. Recent research has provided algorithms for selecting the best hyperparameters in a differentially private fashion [LT19]. ii) Choice of network architecture: it is not always true that the best known model architectures for non-private model training are indeed the best for training with differential privacy. In particular, we know that the number of model parameters may have adverse effects on the privacy/utility trade-offs [BST14]. Hence, choosing the right model architecture is important for providing a good privacy/utility trade-off [PTS21]. (iii) Training in the federated/distributed setting: in the above exposition, we assumed that the training data lies in a single centralized location. However, in settings like Federated Learning (FL) [MMRHA17], the data records can be highly distributed, e.g., across various mobile devices. Running DP-SGD in the FL setting, which is required for FL to provide privacy guarantees for the training data, raises a series of challenges [KMA19] which are often facilitated by distributed private learning algorithms designed specifically for FL settings [BKMTT20, KMSTTZ21]. Some of the specific challenges in the context of FL include, limited and non-uniform availability of clients (holding individual data records) and unknown (and variable) size of the training data [BKMTT18]. On the other hand, PATE style algorithms lend themselves naturally to the distributed setting once combined with existing cryptographic primitives, as demonstrated by the CaPC protocol [CDD21]. It is an active area of research to address these above challenges.

## Acknowledgements

The authors would like to thank Thomas Steinke and Andreas Terzis for detailed feedback and edit suggestions. Parts of this blog post previously appeared on www.cleverhans.io.

## Citations

[ACG16] Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., & Zhang, L. (2016, October). Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (pp. 308-318). ACM.

[BBG18] Balle, B., Barthe, G., & Gaboardi, M. (2018). Privacy amplification by subsampling: Tight analyses via couplings and divergences. arXiv preprint arXiv:1807.01647.

[BKMTT18] Balle, B., Kairouz P., McMahan M., Thakkar O. & Thakurta A. (2020). Privacy amplification via random check-ins. In NeurIPS.

[MMRHA17] McMahan, B., Moore, E., Ramage, D., Hampson, S., & y Arcas, B. A. (2017, April). Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics (pp. 1273-1282). PMLR.

[KMSTTZ18] Kairouz P., McMahan M., Song S., Thakkar O., Thakurta A., & Xu Z. (2021). Practical and Private (Deep) Learning without Sampling or Shuffling. In ICML.

[BFTT19] Bassily, R., Feldman, V., Talwar, K., & Thakurta, A. Private Stochastic Convex Optimization with Optimal Rates. In NeurIPS 2019.

[BST14] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Proceedings of the 55th Annual IEEE Symposium on Foundations of Computer Science.

[BTT18] Bassily, R., Thakurta, A. G., & Thakkar, O. D. (2018). Model-agnostic private learning. Advances in Neural Information Processing Systems.

[CDD21] Choquette-Choo, C. A., Dullerud, N., Dziedzic, A., Zhang, Y., Jha, S., Papernot, N., & Wang, X. (2021). CaPC Learning: Confidential and Private Collaborative Learning. arXiv preprint arXiv:2102.05188.

[CMS11] Chaudhuri, K., Monteleoni, C., & Sarwate, A. D. (2011). Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(3).

[CTW20] Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., … & Raffel, C. (2020). Extracting training data from large language models. arXiv preprint arXiv:2012.07805.

[DFH15] Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., & Roth, A. (2015). Generalization in adaptive data analysis and holdout reuse. arXiv preprint arXiv:1506.02629.

[DMNS06] Dwork, C., McSherry, F., Nissim, K., & Smith, A. (2006, March). Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference (pp. 265-284). Springer, Berlin, Heidelberg.

[DNPR10] Dwork, C., Naor, M., Pitassi, T., & Rothblum, G. N. (2010, June). Differential privacy under continual observation. In Proceedings of the forty-second ACM symposium on Theory of computing (pp. 715-724).

[DTTZ14] Dwork, C., Talwar, K., Thakurta, A., & Zhang, L. (2014, May). Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing (pp. 11-20).

[HSL20] Huang, Y., Song, Z., Li, K., & Arora, S. (2020, November). Instahide: Instance-hiding schemes for private distributed learning. In International Conference on Machine Learning (pp. 4507-4518). PMLR.

[HR12] Hardt, M., & Roth, A. (2012, May). Beating randomized response on incoherent matrices. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing (pp. 1255-1268).

[HR13] Hardt, M., & Roth, A. (2013, June). Beyond worst-case analysis in private singular vector computation. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing (pp. 331-340).

[JPS16] Johnson, A., Pollard, T., Shen, L. et al. MIMIC-III, a freely accessible critical care database. Sci Data 3, 160035 (2016). https://doi.org/10.1038/sdata.2016.35

[JTT18] Jain, P., Thakkar, O. D., & Thakurta, A. (2018, July). Differentially private matrix completion revisited. In International Conference on Machine Learning (pp. 2215-2224). PMLR.

[INS19] Iyengar, R., Near, J. P., Song, D., Thakkar, O., Thakurta, A., & Wang, L. (2019, May). Towards practical differentially private convex optimization. In 2019 IEEE Symposium on Security and Privacy (SP) (pp. 299-316). IEEE.

[KST12] Kifer, D., Smith, A., & Thakurta, A. (2012, June). Private convex empirical risk minimization and high-dimensional regression. In Conference on Learning Theory (pp. 25-1). JMLR Workshop and Conference Proceedings.

[KMA19] Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., … & Zhao, S. (2019). Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977.

[KV05] Kalai, Adam, and Santosh Vempala. “Efficient algorithms for online decision problems.” Journal of Computer and System Sciences 71.3 (2005): 291-307.

[KLNRS08] Raskhodnikova, S., Smith, A., Lee, H. K., Nissim, K., & Kasiviswanathan, S. P. (2008). What can we learn privately. In Proceedings of the 54th Annual Symposium on Foundations of Computer Science (pp. 531-540).

[LLV07] Li, N., Li, T., & Venkatasubramanian, S. (2007, April). t-closeness: Privacy beyond k-anonymity and l-diversity. In 2007 IEEE 23rd International Conference on Data Engineering (pp. 106-115). IEEE.

[LT19] Liu, J., & Talwar, K. (2019, June). Private selection from private candidates. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing (pp. 298-309).

[M17] Mironov, I. (2017, August). Renyi differential privacy. In Computer Security Foundations Symposium (CSF), 2017 IEEE 30th (pp. 263-275). IEEE.

[MKG07] Machanavajjhala, Ashwin; Kifer, Daniel; Gehrke, Johannes; Venkitasubramaniam, Muthuramakrishnan (March 2007). “L-diversity: Privacy Beyond K-anonymity”. ACM Transactions on Knowledge Discovery from Data.

[NRS07] Nissim, K., Raskhodnikova, S., & Smith, A. (2007, June). Smooth sensitivity and sampling in private data analysis. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing (pp. 75-84).

[NS08] Narayanan, A., & Shmatikov, V. (2008, May). Robust de-anonymization of large sparse datasets. In Security and Privacy, 2008. SP 2008. IEEE Symposium on (pp. 111-125). IEEE.

[PAE16] Papernot, N., Abadi, M., Erlingsson, U., Goodfellow, I., & Talwar, K. (2016). Semi-supervised knowledge transfer for deep learning from private training data. ICLR 2017.

[PTS21] Papernot, N., Thakurta, A., Song, S., Chien, S., & Erlingsson, U. (2020). Tempered sigmoid activations for deep learning with differential privacy. AAAI 2021.

[RCR15] Rudi, A., Camoriano, R., & Rosasco, L. (2015, December). Less is More: Nyström Computational Regularization. In NIPS (pp. 1657-1665).

[SCS13] Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In Proceedings of the 2013 IEEE Global Conference on Signal and Information Processing, GlobalSIP ’13, pages 245–248, Washington, DC, USA, 2013. IEEE Computer Society.

[SPG] Chasing Your Long Tails: Differentially Private Prediction in Health Care Settings. Vinith Suriyakumar, Nicolas Papernot, Anna Goldenberg, Marzyeh Ghassemi. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency.

[SS98] Samarati, Pierangela; Sweeney, Latanya (1998). “Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression” (PDF). Harvard Data Privacy Lab. Retrieved April 12, 2017

[SSS17] Shokri, R., Stronati, M., Song, C., & Shmatikov, V. (2017, May). Membership inference attacks against machine learning models. In Security and Privacy (SP), 2017 IEEE Symposium on (pp. 3-18). IEEE.

[SSTT21] Song, S., Thakkar, O., & Thakurta, A. (2020). Evading the Curse of Dimensionality in Unconstrained Private GLMs. In AISTATS 2021.

[XT07] Xiao X, Tao Y (2007) M-invariance: towards privacy preserving re-publication of dynamic datasets. In: SIGMOD conference, Beijing, China, pp 689–700

[YEO21] Yala, A., Esfahanizadeh, H., Oliveira, R. G. D., Duffy, K. R., Ghobadi, M., Jaakkola, T. S., … & Medard, M. (2021). NeuraCrypt: Hiding Private Health Data via Random Neural Networks for Public Training. arXiv preprint arXiv:2106.02484.

[ZHS19] Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. In International Conference on Learning Representations, 2019.

by Abhradeep Thakurta at October 25, 2021 06:00 PM UTC

### How to send a real number using a single bit (and some shared randomness)

In this post, we'll look at the natural problem of how to communicate an estimate of a real value in [0,1], using just 1 bit.  The post is based on this paper (by Ran Ben-Basat of UCL and Shay Vargaftik of VMware Research and myself -- they helped also with the post) that appeared in ICALP last year.

This question is motivated by various aggregation problems;  multiple sending devices may measure a value, and wish to send the value to an aggregator who will compute something from the received values, such as the average.  In our problem, the senders have a real value x in [0,1] to send, but are constrained to send just a single bit.  Variations of this problem have come up in recent work on distributed/federated learning, where clients compute a gradient vector and send it to a centralized parameter server to update a learning model;  we may want to compress the vector to a small number of bits, or even 1 bit, per coordinate.  (We'll have more to say on the federated learning problem in a future post.)  Of course, it's also just an interesting randomized algorithm problem that seems interesting in its own right.

A natural way to look at the problem is as a variation on rounding.  Given a value x in [0,1], one natural approach if limited to one bit is to deterministically round it to X.  But what should the receiver do when they receive the (rounded) bit value X?  It depends on what one's optimization goal is, but to minimize the maximum possible error, the receiver should have their estimate x' take on the value 3/4 when X is 1, 1/4 otherwise.  Note though that deterministic rounding this way is biased -- the expectation E[x'] does not equal x.  Randomized rounding, where the sender sends 1 with probability x and 0 otherwise, and the receiver uses the received bit X as the estimate x', has the property that E[x'] = x.  Unbiased estimators are arguably more natural for many estimation problems.  Here the measure of performance would be the maximum variance for the estimate over all inputs x, so for randomized rounding the cost is 1/4 (when x = 1/2).

Can one do better than these schemes?  It turns out that you can, if you have available shared randomness.  An approach that has been known in the engineering world (where it has been used in signal processing) is subtractive dithering:

We assume that the sender and receiver have access to shared randomness ℎ∼𝑈[−1/2,1/2].  Given a value x, the sender sends 1 if x+h≥1/2, 0 otherwise.  The receiver estimates x' = X - h.  We leave as an exercise that this is unbiased, which can be shown by deriving the stronger fact that x' is distributed as 𝑈[𝑥−1/2,𝑥+1/2] , and thus Var[𝑥']=1/12.

Subtractive dithering ignores that generating a shared real number may be more costly or problematic than generating a finite number of shared bits.  So one of the results of our paper is developing a "finite shared random bits" unbiased estimator, that corresponds to randomized rounding with no shared bits and converges to subtractive dithering as the number of shared random bits goes to infinity.  (The approach does allow for generating a private random real value.)

Also in our paper, we study biased schemes, aiming to minimize the worst-case expected mean-squared error (where the expectation is over randomness used in the algorithm).  For example, it's very odd in the setting of subtractive dithering that one can obtain estimates smaller than 0 or greater than 1, when the input is restricted to [0,1], but that's a price we pay for having an unbiased estimator.  For  a biased estimator, you might naturally truncate the result from subtractive dithering;  by truncating to [z,1-z] for an appropriate z > 0, you can indeed slightly improve over the worst-case mean-squared error of 1/16 for deterministic rounding.

We then studied various algorithmic improvements to obtain better biased schemes.  We were able to progress by looking at limited shared randomness, namely finding the best algorithm with s shared bits.  For example, consider the case of just 1 shared random bit h in {0,1}.  The receiver receives 1 bit X from the sender, and thus can have four possible estimates x' depending on X and h.  If the 4 possible estimate values are v0, v1, v2, v3 (all between 0 and 1), then it is possible to show the largest possible expected squared error occurs at one of the five inputs 0, 1, (v0+v1)/2, (v1+v2)/2, (v2+v3)/2.   We can then write a quadratic program to find the values that minimizes the worst-case expected squared error.  The end result is the following rounding algorithm:  given 1 shared random bit h in {0,1} and the value x, let X = 1 if x ≥ 0.4 + 0.2h, and 0 otherwise;  then let the estimate x' = 0.1 + 0.2h + 0.6X.  This has a worst-case expected mean-squared error of 1/20, beating deterministic rounding and truncated subtractive dithering.  Using some additional arguments we can handle more shared random bits;  at 8 bits we improve the worst-case expected squared error to about 0.04599, which is quite close to our lower bound of about 0.0459, and this is better than anything we could come up with analytically.  The optimal solution is still not known (an open question for future work!).

Overall there are many variants of the rounding problem, and few tight bounds currently.  So even for simple-seeming problems like rounding, there are still interesting things to do.

by Michael Mitzenmacher (noreply@blogger.com) at October 25, 2021 03:00 PM UTC

### Three-year and tenure-track positions at TTIC (apply by December 1, 2021)

from CCI: jobs

TTIC invites applications for the following faculty positions: research assistant professor (3-year term), tenure-track assistant professor, full or associate professor, and visiting professor. Applicants for research assistant professor positions (RAPs) are encouraged to simultaneously apply for the TTIC RAP program and the Simons-Berkeley Research Fellowship.

Website: https://www.ttic.edu/faculty-hiring/
Email: recruiting@ttic.edu

by shacharlovett at October 25, 2021 02:46 PM UTC

### Fibonacci and I

from Emanuele Viola

The other day I couldn’t remember Fibonacci’s original motivation/presentation of the sequence now famously named after him. This had to be corrected immediately, because of the picture above and my first publication (1994) which includes a simple algorithm to decompress sounds. The compression algorithm works by storing rather than the sound data — think of it as the key — the difference between consecutive keys. The saving comes from not allowing every possible difference, but only those in… the Fibonacci sequence. Why those differences are the right ones is part of the mystique which makes studying the sequence fun. For further technical but not mystical details see the paper; an implementation of the decompressor is given in the Motorola 68000 assembly code.

This is me on my way to Fibonacci from Rome, some years ago:

I actually find some presentations of the sequence a little hard to grasp, so I came up with a trivially different rendering which now will make it impossible for me to forget:

There are two types of trees: Young and old. You start with one young tree. In one period, a young tree produces another young tree and becomes old, and an old tree produces a young tree and dies. How many young trees are there after t periods?

I also couldn’t exactly remember the spiral you can make with these numbers. But you can tile the plane with squares whose sides come from the sequence, if you arrange them in a spiral.

### Assistant/Associate Professor – Theoretical Computer Science at University of Massachusetts Amherst (apply by December 1, 2021)

from CCI: jobs

The College of Information and Computer Sciences (CICS) at the University of Massachusetts Amherst invites applications for tenure-track faculty in Theoretical Computer Science at the Associate and Assistant Professor levels. Exceptional candidates at other ranks may be considered.

Website: https://careers.umass.edu/amherst/en-us/job/510665/assistantassociate-professor-theoretical-computer-science
Email: cmusco@cs.umass.edu

by shacharlovett at October 25, 2021 01:56 PM UTC

### Squaring the circle is mentioned in a Gilbert and Sullivan comic Opera.

The problem of squaring the circle: Given a circle, construct (with ruler and compass) a square with the same area. While browsing the web for more information on this problem (for the blog entry on problems that might be similar to P vs NP: here)  I came across the following:

In the Gilbert and Sullivan comic opera Princess Ida, in the song Gently, Gently  is the line:

... and the circle they will square it one fine day.

(To hear the song see here. The line is towards the end.)

They lyrics are here. That website begins  gsarchive.net which made me wonder Did I at one time set up a website of math refs in Gilbert and Sullivan plays (gsarch is very close to gasarch) ? which IS the kind of thing I would do. The answer is no:  gsarch stands for Gilbert and Sullivan archive. They could have called it gasarch if they used the and in Gilbert and Sullivan but abbreviated archive as arch. Then I would have been far more confused.

Moving on...

In 1884 Princess Ida opened in 1884. For more on this comic opera see here.

In 1882 pi was proven  transcendental and hence one cannot square the circle. For more on pi being transcendental see here.

Kolmogorov Random Thoughts on all of this

0) The song is sung my three men who are making fun of the notion of a women's college. The song is about all the things the women are trying to do that are absurd such as squaring the circle. They also mention perpetual motion machines.

1) Did G and S know that the squaring the circle had been proven impossible, or just that it was thought to be impossible, or just that it was thought to be hard?

2) Was it known that perpetual motion machines were impossible? Or just very hard?

3) G and S used Mathematics in at least one other song:  I am the very model of a modern major general, from The Pirates of Penzance  has the lines:

I'm very well acquainted too with matters mathematical

I understand equations, both the simple and quadratical,

About binomial theorems I'm teeming with the a lot o' news---

With many cheerful facts about the square of the hypotenuse

and later

I'm very good at integral and differential calculus

See here for all the lyrics. The website mentioned in the next point has a pointer to a YouTube video of people singing it.

4) There are many parodies of Modern Major General. The earliest ones I know of is Tom Lehrer's  The Elements. Since making a website of them IS the kind of thing I would do,  while writing this post I did it (Are we compelled to do things that fit our image of ourselves? Yup.) The website is here. It has 36 parodies (as of Oct 17, 2021 when I wrote this blog--- it may have more if you read this later.) That may seem like a lot, but it pales in comparison  to the most satirized song of all time: The 12 days of Christmas which I did an ugly lyrics-only website for back before html had nice tools, see here. It has 143 songs on it but I am sure there are many more. (Note to self: redo that website when you have time. Maybe when I retire.)

4) I suspect that G and S knew more math, or perhaps knew of more math,  than Broadway composers know now. I suspect this is a more general trend: people are more specialized now. Having said that, I need to mention the off-Broadway musical Fermat's last Tango which I liked more than Lance (see his post on it here).

5) How much math would you need to know in order to insert some into your play or movie? With Wikipedia and other web sources you could find out some things, but you would have to have some idea what you are looking for. And perhaps you would need some math background in order to even want to insert some math into your work in the first place.

6)  Here's hoping someone will make a musical about William Rowan Hamilton using this song here as a starting point. I blogged rather optimistically about that possibility here.

by gasarch (noreply@blogger.com) at October 24, 2021 06:51 PM UTC

### New computational geometry journal

from David Eppstein

Over the past year or so I’ve been working with Marc van Kreveld and Wolfgang Mulzer to set up a new diamond open access computational geometry journal, Computing in Geometry and Topology, sponsored by the Society for Computational Geometry, the organization set up to run the annual Symposium on Computational Geometry. It is the second such journal to be established, after the Journal of Computational Geometry. This week the new journal went live and it is now available for submissions. Below is the official announcement we sent out to several mailing lists:

Dear all,

As of this week, the new diamond open access journal

Computing in Geometry and Topology

is welcoming submissions. The website of the journal is https://www.cgt-journal.org/index.php/cgt

Here you can also find the editorial board, the submission guidelines, the scope, and a style file.

If you have any questions about the journal, please send them to info@cgt-journal.org

Best regards, David Eppstein, Marc van Kreveld, and Wolfgang Mulzer

Purpose and scope

Computing in Geometry and Topology aims to support the broader computational geometry and topology community by being a peer-reviewed scientific journal that provides diamond open access. Computing in Geometry and Topology is sponsored by the Society for Computational Geometry.

With the broader computational geometry and topology community, we include researchers in discrete and combinatorial geometry, and any application area of computational geometry and topology. We also include algorithm engineering for geometric computations.

The journal publishes two types of papers. Firstly, the journal publishes original research of sufficient depth and interest. Secondly, the journal publishes high-quality survey papers. Every paper has been thoroughly reviewed by experts in the area.

To emphasize the breadth of the interpretation of computational geometry and topology, the editorial board has different sections that represent the algorithmic and mathematical aspects, the applied aspects, and the engineering aspects.

https://www.cgt-journal.org/index.php/cgt

by David Eppstein at October 24, 2021 05:42 PM UTC

### TR21-147 | Extractors for Sum of Two Sources | Jyun-Jie Liao, Eshan Chattopadhyay

from ECCC papers

We consider the problem of extracting randomness from \textit{sumset sources}, a general class of weak sources introduced by Chattopadhyay and Li (STOC, 2016). An $(n,k,C)$-sumset source $\mathbf{X}$ is a distribution on $\{0,1\}^n$ of the form $\mathbf{X}_1 + \mathbf{X}_2 + \ldots + \mathbf{X}_C$, where $\mathbf{X}_i$'s are independent sources on $n$ bits with min-entropy at least $k$. Prior extractors either required the number of sources $C$ to be a large constant or the min-entropy $k$ to be at least $0.51 n$. As our main result, we construct an explicit extractor for sumset sources in the setting of $C=2$ for min-entropy $\mathrm{poly}(\log n)$ and polynomially small error. We can further improve the min-entropy requirement to $(\log n) \cdot (\log \log n)^{1 + o(1)}$ at the expense of worse error parameter of our extractor. We find applications of our sumset extractor for extracting randomness from other well-studied models of weak sources such as affine sources, small-space sources, and interleaved sources.

### Tenure-Track Faculty Positions at Simon Fraser University (apply by December 1, 2021)

from CCI: jobs

The School of Computing Science at Simon Fraser University (SFU) invites applications for tenure-track faculty positions. The School has multiple openings. Excellent applicants in all areas of computer science will be considered.

Website: https://www.sfu.ca/computing/job-opportunities.html
Email: cs_faculty_affairs@sfu.ca

by shacharlovett at October 23, 2021 07:44 PM UTC

### TCS+ talk: Wednesday, October 27 — Shravas Rao, Northwestern University

The next TCS+ talk will take place this coming Wednesday, October 27th at 1:00 PM Eastern Time (10:00 AM Pacific Time, 19:00 Central European Time, 17:00 UTC). Shravas Rao from Northwestern University will speak about “Degree vs. Approximate Degree and Quantum Implications of Huang’s Sensitivity Theorem” (abstract below).

You can reserve a spot as an individual or a group to join us live by signing up on the online form. Registration is not required to attend the interactive talk, and the link will be posted on the website the day prior to the talk; however, by registering in the form, you will receive a reminder, along with the link. (The recorded talk will also be posted on our website afterwards) As usual, for more information about the TCS+ online seminar series and the upcoming talks, or to suggest a possible topic or speaker, please see the website.

Abstract: Based on the recent breakthrough of Huang (2019), we show that for any total Boolean function $f$,

• The degree of $f$ is at most quadratic in the approximate degree of $f$. This is optimal as witnessed by the OR function.
• The deterministic query complexity of f is at most quartic in the quantum query complexity of $f$. This matches the known separation (up to log factors) due to Ambainis, Balodis, Belovs, Lee, Santha, and Smotrovs (2017).

We apply these results to resolve the quantum analogue of the Aanderaa–Karp–Rosenberg conjecture. We show that if $f$ is a nontrivial monotone graph property of an $n$-vertex graph specified by its adjacency matrix, then $Q(f)=\Omega(n)$, which is also optimal. We also show that the approximate degree of any read-once formula on $n$ variables is $\Theta(\sqrt{n})$.

Based on joint work with Scott Aaronson, Shalev Ben-David, Robin Kothari, and Avishay Tal.

by plustcs at October 23, 2021 07:14 AM UTC

### An Annoying Problem

from Richard Lipton

It’s the stupid questions that have some of the most surprising and interesting answers — Cory Doctorow

 ACM Turing Award source

Robert Tarjan is well known to most—a Turing award winner in 1986 with John Hopcroft. Bob is a professor of Computer Science at Princeton University, and one of the world experts on linear time graph algorithms. I always thought if some NP-complete graph problem had a linear time algorithm, then P=NP would have long been solved by Bob.

Today we talk about a problem that hasn’t been solved by Bob.

The problem I mean was quasi-solved by Bob. That is, he gave a quasi-polynomial time algorithm. The problem is group isomorphism (GpI). Laci Babai discusses its relation as a special case of graph isomorphism (GI) in section 13.2 of his famous 2016 paper giving a quasi-polynomial time algorithm for GI. He says that the annoyance of GpI should temper expectations for getting GI into polynomial time, because:

[I]n spite of considerable effort and the availability of powerful algebraic machinery, Group Isomorphism is still not known to be in ${\mathsf{P}}$.

## Bob’s Least Result

Bob has many great and deep results. For example, Donald Knuth described Bob’s strong connected component algorithm for directed graphs in these words:

The data structures that he devised for this problem fit together in an amazingly beautiful way, so that the quantities you need to look at while exploring a directed graph are always magically at your fingertips.

Here is Ken’s slight editing of Wikipedia’s presentation:

 Input: Graph G = (V,E) whose nodes have fields int index, int lowlink, bool onStack Output: The set of subsets of V denoting the strongly connected components

Stack S := empty stack;
int index := 0;  //global variable with invariant: index = smallest unused index

proc strongconnect(v):
v.index := index; v.lowlink := index; index++;
S.push(v); v.onStack := true;
for each w such that (v,w) is in E:

if w.index is undefined then:  //w not yet visited so recurse on it

strongconnect(w);

else if w.onStack then:

//clever: w.index was lowest unused index at the time

else:
pass;  //(v,w) goes to already-found component, so ignore.
end if;
end for

if v.lowlink == v.index then: //we've completed a component so pop and output it
repeat:
Node w = S.pop();
w.onStack = false;
output w as part of the current strongly connected component;
until w == v;
end if;
end proc;

for each v in V do:
if v.index is undefined then:
strongconnect(v);
end if;
end for;

## The Annoying Problem

Just over ten years ago we talked about: How do we tell if two groups are isomorphic? Precisely on October 8, 2011 we talked about group isomorphism. We called it an annoying problem. It remains annoying.

In the group isomorphism problem (GpI), we are given two finite groups ${G}$ and ${H}$ of order ${n}$ in terms of their multiplication tables and must decide if ${G \cong H}$.

Bob Tarjan famously noted that this could be done ${n^{\log_2 n + O(1)}}$ time. He never published this result and just orally told others about it. The insight was that every group of order ${n}$ had a generator set of size at most ${\log_2 n}$. Since his insight the group isomorphism problem remains roughly the same complexity. The best is the improvement due to David Rosenbaum. His algorithm is still of order ${n^{c \log n}}$. See here for details.

Theorem 1 General group isomorphism is in ${n^{(1/2) log_p n+O(1)}}$ deterministic time where ${p}$ is the smallest prime dividing the order of the group.

Better than Bob’s insight, but still not polynomial time.

## Order Restriction

The results to date on GpI have relied mainly on basic group theory. Bob’s result uses an elementary fact about finite groups. David uses the same insight and adds a deep fact about isomorphism type problems. Neither use any of the “millions” of theorems known about finite groups.

I wondered if there is some hope to start to exploit more of the papers on group structure? The trouble I think is that these results are quite difficult for those not in group theory to understand. But perhaps there is some hope. What do you think?

I still wonder if there is a polynomial time algorithm for the general group isomorphism problem? Can we eliminate the ${\log n}$ exponent somehow? This lead us to note the following simple theorem:

Theorem 2 Let ${G}$ and ${H}$ be finite groups of order ${n}$ that is square free.Then ${G \cong H}$ can be determined in time polynomial in ${n}$ time.

That is, ${n}$ has no prime ${p}$ so that ${p^2}$ divides ${n}$.

## But Wait…

As I worked on this I found a nice paper by Heiko Dietrich and James Wilson, titled “Polynomial Time Isomorphism Tests Of Black-Box Type Groups Of Most Orders.” They reference another paper that proved a stronger theorem than our above one:

Theorem 3 There is an algorithm that given groups ${G}$ and ${H}$ of permutations on finitely many points, decides whether they are of cube-free order, and if so, decides that ${G \cong H}$ or constructs an isomorphism ${G \rightarrow H}$. The algorithm runs in time polynomial in the input size.

## Proof of Our Theorem

Definition 4 A finite group ${G}$ is metacyclic provided it has a normal subgroup ${N}$ so that ${N}$ and ${G/N}$ are both cyclic.

Every finite group of square-free order (i.e. the order is not divisible by the square of a natural number) is metacyclic.

Proof: We claim that ${G}$ and ${H}$ are both metacyclic. This follows from the fact that their order is square free. But then we know that both ${G}$ and ${H}$ are generated by at most two elements—see this. This yields an isomorphism algorithm based on the above generator trick. $\Box$

Their stronger proof uses quite a bit more group theory. But our simple proof may give you some intuition why the restriction on the order of the group helps.

## Open Problems

The above result fits well with the belief that the worst case for isomorphism is when the order of the groups is a power of a prime. Some possible conjectures are:

• Can we generalize “${n}$ is square free” to “${p^4}$ does not divide ${n}$“? Or to “${p^{5}}$ does not divide ${n}$?” And so on?

• Let ${Z(G)=1}$ and ${|Aut(G)| \le |G|^{O(1)}}$. Then GpI is in polynomial time.

Definition 5 A finite group ${G}$ is metabelian provided it has a normal subgroup ${N}$ so that ${N}$ and ${G/N}$ are both abelian.

• What if we replace metacyclic by metabelian?

by rjlipton at October 22, 2021 08:53 PM UTC

### An Orthodox rabbi and Steven Weinberg walk into an email exchange…

from Scott Aaronson

Ever since I posted my obituary for the great Steven Weinberg three months ago, I’ve gotten a steady trickle of emails—all of which I’ve appreciated enormously—from people who knew Steve, or were influenced by him, and who wanted to share their own thoughts and memories. Last week, I was contacted by one Moshe Katz, an Orthodox rabbi, who wanted to share a long email exchange that he’d had with Steve, about Steve’s reasons for rejecting his birth-religion of Judaism (along with every other religion). Even though Rabbi Katz, rather than Steve, does most of the talking in this exchange, and even though Steve mostly expresses the same views he’d expressed in many of his public writings, I knew immediately on seeing this exchange that it could be of broader interest—so I secured permission to share it here on Shtetl-Optimized, both from Rabbi Katz and from Steve’s widow Louise.

While longtime readers can probably guess what I think about most of the topics discussed, I’ll refrain from any editorial commentary in this post—but of course, feel free to share your own thoughts in the comments, and maybe I’ll join in. Mostly, reading this exchange reminded me that someone at some point should write a proper book-length biography of Steve, and someone should also curate and publish a selection of his correspondence, much like Perfectly Reasonable Deviations from the Beaten Track did for Richard Feynman. There must be a lot more gems to be mined.

Anyway, without further ado, here’s the exchange (10 pages, PDF).

by Scott at October 22, 2021 03:28 PM UTC

### Welcome to scottaaronson.blog !

from Scott Aaronson

If you’ve visited Shtetl-Optimized lately — which, uh, I suppose you have — you may have noticed that your URL was redirected from www.scottaaronson.com/blog to scottaaronson.blog. That’s because Automattic, makers of WordPress.com, volunteered to move my blog there from Bluehost, free of charge. If all goes according to plan, you should notice faster loading times, less downtime, and hopefully nothing else different. Please let me know if you encounter any problems. And huge thanks to the WordPress.com Special Projects Team, especially Christopher Jones and Mark Drovdahl, for helping me out with this.

by Scott at October 21, 2021 09:35 PM UTC

### Postdoc Positions

from Luca Trevisan

The call is out for two postdoctoral positions at Bocconi to work in my group [application link]. If you are interested and you have any questions, feel free to email me (L.Trevisan at Unibocconi dot it)

The negotiable start date is September 1st, 2022. Each position is for one year, renewable for a second. The positions offer an internationally competitive salary (up to 65,000 Euro per year, tax-free, plus relocation assistance and travel allowance), in a wonderful location that, at long last, is back to more or less normal life. The application deadline is December 17, 2021.

Among the topics that I am interested in are spectral graph theory, average-case complexity, “applications” of semidefinite programming, random processes on networks, approximation algorithms, pseudorandomness and combinatorial constructions.

Bocconi Computer Science is building up a theory group: besides me, we have Alon Rosen, Marek Elias, a tenured person that will join next Fall, and more hires are on the horizon. Now that traveling is ok again, and considering that Alon and I both have ERC grants, we should expect a big stream of theory visitors coming and going through Bocconi from week-long visits to semester or year long sabbaticals.

### postdoc at IRIF, DIENS, LIP6 (Paris, France) (apply by November 1, 2021)

from CCI: jobs

We are looking to recruit 1 to 3 postdocs for 1 to 3 years to work on the algorithmic theory of new data models (Theory, Algorithms), at IRIF (Université de Paris), LIP6 (Sorbonne Université), and DIENS (Université PSL) in Paris, France, with three years of funding by the ANR grant Algoridam (https://www.irif.fr/~algoridam/) and a flexible starting date.

Website: https://www.irif.fr/postes/postdoc
Email: postdoc.algoridam@listes.irif.fr.

by shacharlovett at October 20, 2021 10:50 PM UTC

### Opportunities at Harvard!

Computer Science at Harvard, and in particular theoretical computer science and machine learning, is growing fast, see my 21-Tweet thread:

Please consider applying for graduate studies in computer science (or encourage others to apply if like me, your grad-school days are behind you).

In recent years, I’ve taken a special interest in the theory of machine learning, and there are several opportunities at Harvard in this area. In particular, I’ve been collaborating with people both in and outside CS, including incoming CS faculty Sham Kakade, as well as Demba Ba (Electrical Engineering and Bioengineering), Lucas Janson (Statistics), and  Cengiz Pehlevan (Applied Mathematics). I’m very much open to co-advising students outside computer science as well.

Students interested in quantum computation and information can apply to any of our programs (including computer science, physics, and others) as well as to the new quantum science and engineering degree. There are a number of Harvard faculty interested in quantum computing including new incoming faculty member Anurag Anshu. In this area as well I am open to co-advising, including students outside CS.

If you are applying to graduate studies and are potentially interested in working with me, you can indicate this in the application form and statement of interest. This is the best way to ensure that I read your application (and in particular better than emailing me separately: for fairness sake I read all the applications together in batch, regardless of whether the candidate emailed me or not).

We are also looking for postdocs! We have the general Rabin postdoc position, as well as specific positions in privacy, fairness, and theory of machine learning, all described in the following ad. There is also a separate networking+theory postdoc position. Several other machine-learning theory related postdocs are linked in our ML theory opportunities page. See also the quantum initiative postdoctoral fellowship. There may be more CS, ML and quantum related opportunities that I am missing. If I hear of more relevant opportunities then I will post them here and/or on Twitter.

Finally, Harvard welcomes applications from candidates of all backgrounds, regardless of disability, gender identity and expression, physical appearance, race, religion, or sexual orientation. I would like to especially encourage applications from members of under-represented groups. Computer Science, and in particular the areas I work in, have a significant diversity problem. We are trying at Harvard to create an environment that is welcoming and inclusive for people from all backgrounds. I am sincerely grateful and appreciative of people placing their trust in us by applying, and we will do our best to live up to that trust.

by Boaz Barak at October 20, 2021 08:01 PM UTC

### Networking+Theory Postdoc at Harvard

The last couple of years one aspect of research I've greatly enjoyed is getting back into networking, which is really due to my excellent (and patient) collaborators Ran Ben Basat (now at UCL, was a postdoc at Harvard) and Minlan Yu (at Harvard).  Minlan and I are working to establish a larger-scale Networking+Theory (hopefully broadening to an even larger Systems+Theory) group at Harvard, working on algorithmic problems in the context of real (or at least very real-ish) systems.  We have funding, and are looking for a postdoc, the basic description is below.  Ideally we're looking for people comfortable with the theory side and the systems side.  The website link for applying is https://academicpositions.harvard.edu/postings/10748 .  We have preliminary website for the group at https://projects.iq.harvard.edu/theosys (it's just a start, Minlan and I are both on sabbatical, but you can see some of our publications).  We look forward to finding another member of the team!

Networking + Theory Postdoctoral Position

The John A. Paulson School of Engineering and Applied Sciences at Harvard University (SEAS) seeks applicants for a postdoctoral position in networking and theory. The postdoc is intended for one year but there will be funding to potentially extend it to a second year. The postdoc will receive a generous salary as well as an allocation for research and travel expenses.

We are looking for junior scientists who are especially interested in working at the intersection of networking and algorithmic theory, in areas such as programmable network architectures, data center network management, cloud computing, and algorithms for the Internet.  Example topics of interest include but are not limited to the design and analysis of sketches and filters for use in real systems, network security, network compression methods, and optimizing network performance for machine learning applications.  The ideal candidate will be interested in both building real systems and either developing algorithms and data structures or using existing, underutilized results from the theoretical literature in system design.

The postdoc is intended to work with closely with Minlan Yu and Michael Mitzenmacher, and others involved in the group focused on Systems + Theory work that they are developing, as well as possibly other Harvard faculty.

Candidates should have backgrounds in networking and/or theoretical computer science.  Candidates should demonstrate experience in working at the intersection of these areas, or otherwise demonstrate how they will be able to contribute at the intersection. The candidate will be expected to publish scholarly papers, attend internal, domestic, and international conferences and meetings, and take on a mentorship role for undergraduate and graduate students.

Harvard SEAS is dedicated to building a diverse community that is welcoming for everyone, regardless of disability, gender identity and expression, physical appearance, race, religion, or sexual orientation.  We strongly encourage applications from members of underrepresented groups.

by Michael Mitzenmacher (noreply@blogger.com) at October 20, 2021 06:13 PM UTC

### Online Optimization Post 6: The Impagliazzo Hard-Core Set Lemma

from Luca Trevisan

(This is the sixth in a series of posts on online optimization techniques and their “applications” to complexity theory, combinatorics and pseudorandomness. The plan for this series of posts is to alternate one post explaining a result from the theory of online convex optimization and one post explaining an “application.” The first two posts were about the technique of multiplicative weight updates and its application to “derandomizing” probabilistic arguments based on combining a Chernoff bound and a union bound. The third and fourth post were about the Follow-the-Regularized-Leader framework, and how it unifies multiplicative weights and gradient descent, and a “gradient descent view” of the Frieze-Kannan Weak Regularity Lemma. The fifth post was about the constrained version of the Follow-the-Regularized-Leader framework, and today we shall see how to apply that to a proof of the Impagliazzo Hard-Core Lemma.)

1. The Impagliazzo Hard-Core Lemma

The Impagliazzo Hard-Core Lemma is a striking result in the theory of average-case complexity. Roughly speaking, it says that if ${g: \{ 0,1 \}^n \rightarrow \{ 0,1 \}}$ is a function that is “weakly” hard on average for a class ${\cal F}$ of “efficiently computable” functions ${f}$, that is, if, for some ${\delta>0}$, we have that

$\displaystyle \forall f \in {\cal F}: \ \ \Pr_{x\sim \{ 0,1\}^n} [f(x) = g(x) ] \leq 1 -\delta$

then there is a subset ${H\subseteq \{ 0,1 \}^n}$ of cardinality ${\geq 2\delta 2^n}$ such that ${g}$ is “strongly” hard-on-average on ${H}$, meaning that

$\displaystyle \forall f \in {\cal F}: \ \ \Pr_{x\sim H} [f(x) = g(x) ] \leq \frac 12 + \epsilon$

for a small ${\epsilon >0}$. Thus, the reason why functions from ${\cal F}$ make a mistake in predicting ${g}$ at least a ${\delta}$ fraction of the times is that there is a “hard-core” set ${H}$ of inputs such that every function from ${\cal F}$ makes a mistake about 1/2 of the times for the ${2\delta}$ fraction of inputs coming from ${H}$.

The result is actually not literally true as stated above, and it is useful to understand a counterexample, in order to motivate the correct statement. Suppose that ${\cal F}$ contains just ${1/\delta}$ functions, and that each function ${f\in \cal F}$ differs from ${g}$ in exactly a ${\delta}$ fraction of inputs from ${\{ 0,1 \}^n}$, and that the set of mistakes are disjoint. Thus, for every set ${H\subseteq \{ 0,1 \}^n}$, no matter its size, there is a function ${f\in \cal F}$ that agrees with ${g}$ on at least a ${1-\delta}$ fraction of inputs from ${H}$. The reason is that the sets of inputs on which the functions of ${\cal F}$ differ from ${g}$ form a partition of ${\{ 0,1 \}^n}$, and so their intersections with ${H}$ form a partition of ${H}$. By an averaging argument, one of those intersections must then contain at most ${\delta |H|}$ elements of ${H}$.

In the above example, however, if we choose any three distinct functions ${f_1,f_2,f_3}$ from ${\cal F}$, we have

$\displaystyle \forall x\in \{ 0,1 \}^n: \ \ \ g(x) = {\rm majority} (f_1(x), f_2(x),f_3(x))$

So, although ${g}$ is weakly hard on average with respect to ${\cal F}$, we have that ${g}$ is not even worst-case hard for a slight extension of ${\cal F}$ in which we allow functions obtained by simple compositions of a small number of functions of ${\cal F}$.

Theorem 1 (Impagliazzo Hard-Core Lemma) Let ${\cal F}$ be a collection of functions ${f: \{ 0,1 \}^n \rightarrow \{ 0,1 \}}$, let ${g: \{ 0,1 \}^n \rightarrow \{ 0,1 \}}$ a function, and let ${\epsilon>0}$ and ${\delta >0}$ be positive reals. Then at least one of the following conditions is true:

• (${g}$ is not weakly hard-on-average over ${\{ 0,1 \}^n}$ with respect to a slight extension of ${\cal F}$) There is a ${k= O(\epsilon^{-2} \log \delta^{-1} )}$, an integer ${b}$, and ${k}$ functions ${f_1,\ldots,f_k \in \cal F}$, such that

$\displaystyle h(x) := I \{ f_1(x) + \ldots + f_k(x)\geq b \}$

satisfies

$\displaystyle \Pr_{x\in \{ 0,1 \}^n} [ g(x) = h(x) ] \geq 1-\delta$

• (${g}$ is strongly hard-on-average over a set ${H}$ of density ${2\delta}$) There is a set ${H\subseteq \{ 0,1 \}^n}$ such that ${H \geq 2\delta \cdot 2^n}$ and

$\displaystyle \forall f\in {\cal F}: \ \ \Pr_{x\in H} [ g(x) = f(x) ] \leq \frac 12 + \epsilon$

Where ${I \{ {\rm boolean\ expression} \}}$ is equal to ${1}$ or ${0}$ depending on whether the boolean expression is true or false (the letter “${I}$” stands for “indicator” function of the truth of the expression).

2. Proving the Lemma

Impagliazzo’s proof had ${k}$ polynomial in both ${1/\epsilon}$ and ${1/\delta}$, and an alternative proof discovered by Nisan has a stronger bound on ${k}$ of the order of ${\epsilon^{-2} \log \epsilon^{-1} \delta^{-1}}$. The proofs of Impagliazzo and Nisan did not immediately give a set of size ${2\delta2^n}$ (the set had size ${\delta 2^n}$), although this could be achieved by iterating their argument. An idea of Holenstein allows to prove the above statement in a more direct way.

Today we will see how to obtain the Impagliazzo Hard-Core Lemma from online optimization, as done by Barak, Hardt and Kale. Their proof achieves all the parameters claimed above, once combined with Holenstein’s ideas.

We say that a distribution ${M}$ (here “${M}$” stands for probability measure; we use this letter since we have already used ${D}$ last time to denote the Bregman divergence) has min-entropy at least ${K}$ if, for every ${x}$, ${M(x) \leq 2^{-K}}$. In other words, the min-entropy of a distribution ${M}$ over a sample space ${X}$ is defined as

$\displaystyle H_{\infty} (M) := \min_{x\in X} \log_2 \frac 1 {M(x)}$

The uniform distribution over a set ${H}$ has min-entropy ${\log_2 |H|}$, and all distributions of min-entropy ${K}$ can be realized as a convex combination of distributions that are each uniform over a set of size ${\geq K}$, thus uniform distributions over large sets and large-min-entropy distributions are closely-related concepts. We will prove the following version of the hard-core lemma:

Theorem 2 (Impagliazzo Hard-Core Lemma — Min-Entropy Version) Let ${X}$ be a finite set, ${\cal F}$ be a collection of functions ${f: X \rightarrow \{ 0,1 \}}$, let ${g: X \rightarrow \{ 0,1 \}}$ a function, and let ${\epsilon>0}$ and ${\delta >0}$ be positive reals. Then at least one of the following conditions is true:

• (${g}$ is not weakly hard-on-average over ${X}$ with respect to ${\cal F}$) There is a ${k= O(\epsilon^{-2} \log \delta^{-1} )}$, an integer ${b}$, and ${k}$ functions ${f_1,\ldots,f_k \in \cal F}$, such that

$\displaystyle h(x) := I \{ f_1(x) + \ldots + f_k(x)\geq b \}$

satisfies

$\displaystyle \Pr_{x\in X} [ g(x) = h(x) ] \geq 1-\delta$

• (${g}$ is strongly hard-on-average on a distribution of min-entropy ${\geq \log_2 2\delta |X|}$) There is a distribution ${H}$ of min-entropy ${\geq \log_2 2\delta|X|}$ such that

$\displaystyle \forall f\in {\cal F}: \ \ \Pr_{x\sim H} [ g(x) = f(x) ] \leq \frac 12 + \epsilon$

Under minimal assumptions on ${\cal F}$ (that it contains ${<< 2^{|X|}}$ functions), the min-entropy version implies the set version, and the min-entropy version can be used as-is to derive most of the interesting consequences of the set version.

Let us restate it one more time.

Theorem 3 (Impagliazzo Hard-Core Lemma — Min-Entropy Version) Let ${X}$ be a finite set, ${\cal F}$ be a collection of functions ${f: X \rightarrow \{ 0,1 \}}$, let ${g: X \rightarrow \{ 0,1 \}}$ a function, and let ${\epsilon>0}$ and ${\delta >0}$ be positive reals. Suppose that for every distribution ${H}$ of min-entropy ${\geq \log_2 2\delta|X|}$ we have

$\displaystyle \forall f\in {\cal F}: \ \ \Pr_{x\sim H} [ g(x) = f(x) ] > \frac 12 + \epsilon$

Then there is a ${k= O(\epsilon^{-2} \log \delta^{-1} )}$, an integer ${b}$, and ${k}$ functions ${f_1,\ldots,f_k \in \cal F}$, such that

$\displaystyle h(x) := I \{ f_1(x) + \ldots + f_k(x)\geq b \}$

satisfies

$\displaystyle \Pr_{x\in X} [ g(x) = h(x) ] \geq 1-\delta$

As in previous posts, we are going to think about a game between a “builder” that works toward the construction of ${h}$ and an “inspector” that looks for defects in the construction. More specifically, at every round ${i = 1,\ldots,T}$, the inspector is going to pick a distribution ${M_i}$ of min-entropy ${\geq \log_2 2\delta|X|}$ and the builder is going to pick a function ${f_i\in {\cal F}}$. The loss function, that the inspector wants to minimize, is

$\displaystyle L_i (M) := \mathop{\mathbb P}_{x\sim M} [f_i (x) = g(x)]$

The inspector runs the agile online mirror descent algorithm with the constraint of picking distributions of the required min-entropy, and using the entropy regularizer; the builder always chooses a function ${f_i}$ such that that

$\displaystyle L_i (M) := \mathop{\mathbb P}_{x\sim M} [f_i (x) = g(x)] > \frac 12 + \epsilon$

which is always a possible choice given the assumptions of our theorem.

Just by plugging the above setting into the analysis from the previous post, we get that if we play this online game for ${T = O(\epsilon^{-2} \log \delta^{-1})}$ steps, the builder picks functions ${f_1,\ldots,f_T}$ such that, for every distribution ${H}$ of min-entropy ${\geq \log 2\delta |X|}$, we have

$\displaystyle \Pr_{x\sim H, \ i \sim \{ 1,\ldots,T \} } [ f_i(x) = g(x) ] > \frac 12 \ \ \ \ \ (1)$

We will prove that (1) holds in the next section, but we emphasize again that it is just a matter of mechanically using the analysis from the previous post. Impagliazzo’s proof relies, basically, on playing the game using lazy mirror descent with ${\ell_2^2}$ regularization, and he obtains a guarantee like the one above after ${T=O(\epsilon^{-2} \delta^{-1})}$ steps.

What do we do with (1)? Impagliazzo’s original reasoning was to define

$\displaystyle h(x) := majority (f_1(x),\ldots,f_T(x))$

and to consider the set ${B}$ of “bad” inputs ${x}$ such that ${h(x) \neq g(x)}$. We have

$\displaystyle \forall x \in B \ \ \Pr_{i \sim \{ 1,\ldots,T \} } [ f_i(x) = g(x) ] \leq \frac 12$

and so

$\displaystyle \Pr_{x\sim B, \ i \sim \{ 1,\ldots,T \} } [ f_i(x) = g(x) ] \leq \frac 12$

The min-entropy of the uniform distribution over ${B}$ is ${\log_2 |B|}$, and this needs to be less than ${\log_2 2\delta |X|}$, so we conclude that ${h(x) \neq g(x)}$ happens for at most a ${2\delta}$ fraction of elements of ${X}$.

This is qualitatively what we promised, but it is off by a factor of 2 from what we stated above. The factor of 2 comes from a subsequent idea of Holenstein. In Holenstein’s analysis, we sort elements of ${X}$ according to

$\displaystyle \Pr_{i \sim \{ 1,\ldots, T \}} [ f_i(x) = g_i (x) ]$

and he lets ${B}$ be the set of ${2\delta |X|}$ elements of ${X}$ for which the above quantity is smallest, and he shows that if we properly pick an integer ${b}$ and define

$\displaystyle h(x) := I\{ f_1(x) + \cdots + f_T(x) \geq b \}$

then ${h(x)}$ will be equal to ${g(x)}$ for all ${x\not\in B}$ and also for at least half the ${x\in B}$, meaning that ${h(x) = g(x)}$ for at least a ${1-\delta}$ fraction of the input. Since this is a bit outside the scope of this series of posts, we will not give an exposition of Holenstein’s argument.

3. Analysis of the Online Game

It remains to show that we can achieve (1) with ${T}$ of the order of ${\frac 1 {\epsilon^2} \log \frac 1 {\delta}}$. As we said, we play a game in which, at every step ${i=1,\ldots,T}$

• The “inspector” player picks a distribution ${M_i}$ of min-entropy at least ${\log_2 2\delta |X|}$, that is, it picks a number ${\frac 1 {2\delta |X|} \geq M_i(x)\geq 0}$ for each ${x\in X}$ such that ${\sum_x M_i(x) = 1}$.
• The “builder” player picks a function ${f_i \in {\cal F}}$, whose existence is guaranteed by the assumption of the theorem, such that

$\displaystyle \Pr_{x\sim M_i} [f_i(x) = g(x) ] \geq \frac 12 +\epsilon$

and defines the loss function

$\displaystyle L_i(M) := \Pr_{x\sim M_i} [f_i(x) = g(x) ] = \sum_{x\in X} M(x) \cdot I\{ f_i(x) = g(x) \}$

• The “inspector” is charged the loss ${L_i(M_i)}$.

We analyze what happens if the inspector plays the strategy defined by agile mirror descent with negative entropy regularizer. Namely, we define the regularizer

$\displaystyle R(M) = c \sum_x M(x) \log M(x)$

for a choice of ${c}$ that we will fix later. The corresponding Bregman divergence is

$\displaystyle D(M,\hat M) = c KL(M,\hat M) = c \cdot \left( \sum_x M(x) \log \frac {M(x)}{\hat M(x)} - \sum_x M(x) + \sum_x \hat M(x) \right)$

and we work over the space ${\cal K}$ of distributions of min-entropy ${\geq \log_2 2\delta |X|}$.

The agile online mirror descent algorithm is

$\displaystyle M_1 = \arg\min_{M\in {\cal K}} R(M)$

so that ${M_1}$ is the uniform distribution, and for ${i\geq 1}$

$\displaystyle \hat M_{i+1} = \arg\min_{M: X \rightarrow {\mathbb R}} \ \ D(M_i,M) + L_i (M)$

$\displaystyle M_{i+1} = \arg\min_{M \in {\cal K}} \ \ D(M, \hat M_{i+1} )$

Solving the first step of agile online mirror descent, we have

$\displaystyle \hat M_{i+1} (x) = M_i(x) e^{-\frac 1c I \{ f(x) = g(x) \} }$

Using the analysis from the previous post, for every distribution ${M}$ in ${\cal K}$, and every number ${T}$ of steps, we have the regret bound

$\displaystyle \sum_{i=1}^T L_i(M_i) - L_i(M) \leq D(M,M_1) + \sum_{i=1}^T D(M_i, \hat M_{i+1} )$

and we can bound

$\displaystyle D(M,M_1) = c \sum_x M(x) \cdot \ln |X| \cdot M(x) \leq c \ln \frac 1 {2\delta}$

and

$\displaystyle D(M_i,\hat M_{i+1}) = c\cdot \left( \sum_x M_i(x) \ln \frac{M_i(x)}{\hat M_{i+1} (x) } + \sum_x \hat M_{i+1}(x) - M_i(x) \right)$

$\displaystyle = c \cdot \sum_x M_i(x) \cdot \left( \frac 1c I\{ f(x) = g(x) \} + e^{-\frac 1 cI \{ f(x) = g(x)\}} -1 \right)$

$\displaystyle \leq O \left( \frac 1c \right)$

where, in the last step, we used the fact the quantity in parenthesis is either 0 or ${1/c + e^{-1/c} - 1}$ which is ${O(1/c^2)}$, and that ${\sum_x M_i(x) = 1}$ because ${M_i}$ is a distribution.

Overall, the regret is bounded by

$\displaystyle \sum_{i=1}^T L_i(M_i) - L_i(M) \leq O \left( c\log \frac 1\delta + \frac Tc \right) \leq O \left( \sqrt{T \log \frac 1\delta}\right)$

where the last inequality comes from an optimized choice of ${c}$.

Recall that we choose the functions ${f_i}$ so that ${L_i(M_i) \geq 1/2 + \epsilon}$ for every ${i}$, so for every ${M\in {\cal K}}$

$\displaystyle \frac 1T \sum_{i=1}^T L_i (M) \geq \frac 12 + \epsilon - O (\left( \sqrt{\frac 1 T \log \frac 1\delta}\right)$

and by choosing ${T}$ of the order of ${\frac 1 {\epsilon^2} \log \frac 1 \delta}$ we get

$\displaystyle \forall M \in {\cal K} : \ \ \ \frac 1T \sum_{i=1}^T L_i (M) > \frac 12$

It remains to observe that

$\displaystyle \frac 1T \sum_{i=1}^T L_i (M) = \frac 1T \sum_{i=1}^T \Pr_{x\sim M} [f_i(x) = g(x) ] = \Pr_{i \sim \{1,\ldots,T\}, \ x\sim M} [ f_i (x) = g(x) ]$

so we have that for every distribution ${M}$ of min-entropy at least ${\log_2 2\delta |X|}$ it holds that

$\displaystyle \Pr_{i \sim \{1,\ldots,T\}, \ x\sim M} [ f_i (x) = g(x) ]> \frac 12$

which is the statement that we promised and from which the Impagliazzo Hard-Core Lemma follows.

4. Some Final Remarks

After Impagliazzo circulated a preliminary version of his paper, Nisan had the following idea: consider the game that we define above, in which a builder picks an ${f\in {\cal F}}$, an inspector picks a distribution ${M \in {\cal K}}$ of the prescribed min-entropy, and the loss for the inspector is given by ${\Pr [ f(x) = g(x) ]}$. We can think of it as a zero-sum game if we also assign a gain ${\Pr [ f(x) = g(x)]}$ to the builder.

If the builder plays second, there is a strategy that guarantees a gain that is at least ${1/2 + \epsilon}$, and so there must be a mixed strategy, that is, a distribution ${{\cal DF}}$ over functions in ${\cal F}$, that guarantees such a gain even if the builder plays first. In other words, for all distributions ${H}$ of the prescribed min-entropy we have

$\displaystyle \Pr_{x\sim H, f\sim {\cal DF}} [ f(x) = g(x) ] \geq \frac 12 + \epsilon$

Nisan then observes that we can sample ${T = \frac 1{\epsilon^2} \log |X|}$ functions ${f_1,\ldots,f_T}$ and have, with high probability

$\displaystyle \Pr_{x\sim H, i\sim \{1,\ldots,T\}} [ f_i(x) = g(x) ] > \frac 12$

and the sampling bound on ${T}$ can be improved to order of ${\frac 1 {\epsilon^2} \log \frac 1{\epsilon \delta}}$ with the same conclusion.

Basically, what we have been doing today is to come up with an algorithm that finds an approximate solution for the LP that defines the optimal mixed strategy for the game, and to design the algorithm is such a way that the solution is very sparse.

This is a common feature of other applications of online optimization techniques to find “sparse approximations”: one sets up an optimization problem whose objective function measures the “approximation error” of a given solution. The object we want to approximate is the optimum of the optimization problem, and we use variants of mirror descent to prove the existence of a sparse solution that is a good approximation.

### Department Chair, Computer Science and Engineering at NYU Tandon (apply by December 1, 2021)

from CCI: jobs

NYU’s Dept. of Computer Science and Engineering within the Tandon School of Engineering is hiring a department chair. The position is open to all research areas, but applications from theory candidates are very welcome. Our department has a growing Algorithms and Foundations Group (csefoundations.engineering.nyu.edu), with several recent hires working on theory of algorithms and machine learning.

Website: https://apply.interfolio.com/96495
Email: cmusco@nyu.edu

by shacharlovett at October 20, 2021 03:56 PM UTC

### Assistant Professor at Georgetown University (apply by December 15, 2021)

from CCI: jobs

Provost’s Distinguished Faculty Fellow and Assistant Professor of Computer Science position available at Georgetown University. For information on how to apply, please visit https://cs.georgetown.edu/jobs/

Website: http://apply.interfolio.com/96143
Email: nitin.vaidya@georgetown.edu

by shacharlovett at October 20, 2021 02:45 PM UTC

### The Saga of Highleyman's Data.

from Ben Recht

The first machine learning benchmark dates back to the late 1950s. Few used it and even fewer still remembered it by the time benchmarks became widely used in machine learning in the late 1980s.

In 1959 at Bell Labs, Bill Highleyman and Louis Kamenstky designed a scanner to evaluate character recognition techniques. Their goal was “to facilitate a systematic study of character-recognition techniques and an evaluation of methods prior to actual machine development.” It was not clear at the time which part of the computations should be done in special purpose hardware and which parts should be done with more general computers. Highleyman later patented an OCR scheme that we recognize today as a convolutional neural network with convolutions optically computed as part of the scanning.

Highleyman and Kamentsky used their scanner to create a dataset of 1800 alphanumeric characters. They gathered the 26 letters of the alphabet and 10 digits from 50 different writers. Each character in their corpus was scanned in binary at a resolution of 12 x 12 and stored on punch cards that were compatible with the IBM 704, the GPGPU of the era.

With the data in hand, Highleyman and Kamenstky began studying various proposed techniques for recognition. In particular, they analyzed a method of Woody Bledsoe and published an analysis claiming to be unable to reproduce Bledsoe’s results. Bledsoe found their numbers to be considerably lower than he had expected, and asked Highleyman to send him the data. Highleyman obliged, mailing the package of punch cards across the country to Sandia Labs.

Upon receiving the data, Bledsoe conducted a new experiment. In what may be the first implementation of a train-test split, he divided the characters up, using 40 writers for training and 10 for testing. By tuning the hyperparameters, Bledsoe was able to achieve approximately 60% error. Bledsoe also suggested that the high error rates were to be expected as Highleyman’s data was too small. Prophetically, he declared that 1000 alphabets might be needed for good performance.

By this point, Highleyman had also shared his data with Chao Kong “C.K.” Chow at the Burroughs Corporation (a precursor to Unisys). A pioneer of using decision theory for pattern recognition, Chow built a pattern recognition system for characters. Using the same train-test split as Bledsoe, Chow obtained an error rate of 41.7% using a convolutional neural network.

Highleyman made at least six additional copies of the data he had sent to Bledsoe and Chow, and many researchers remained interested. He thus decided to publicly offer to send a copy to anyone willing to pay for the duplication and shipping fees. An interested party would simply have to mail him a request. Of course, the dataset was sent by US Postal Service. Electronic transfer didn’t exist at the time, resulting in sluggish data transfer rates on the order of a few bits per minute.

Highleyman not only created the first machine learning benchmark. He authored the the first formal study of train-test splits and proposed empirical risk minimization for pattern classification as part of his 1961 dissertation. By 1963, however, Highleyman had left his research position at Bell Labs and abandoned pattern recognition research.

We don’t know how many people requested Highleyman’s data. The total number of copies may have been less than twenty. Based on citation surveys, we determined there were at least another six copies made after Highleyman’s public offer for duplication, sent to CMU, Honeywell, SUNY Stony Brook, Imperial College, UW Madison, and Stanford Research Institute (SRI).

The SRI team of John Munson, Richard Duda, and Peter Hart performed some of the most extensive experiments with Highleyman’s data. A 1-nearest-neighbors baseline achieved an error rate of 47.5%. With a more sophisticated approach, they were able to do significantly better. They used a multi-class, piecewise linear model, trained using Kesler’s multi-class version of the perceptron algorithm (what we’d now call “one-versus all classification”). Their feature vectors were 84 simple pooled edge detectors in different regions of the image at different orientations. With these features, they were able to get a test error of 31.7%, 10 percentage points better than Chow. When restricted only to digits, this method recorded 12% error. The authors concluded that they needed more data, and that the error rates were “still far too high to be practical.” They concluded that “larger and higher-quality datasets are needed for work aimed at achieving useful results.” They suggested that such datasets “may contain hundreds, or even thousands, of samples in each class.”

Munson, Duda, and Hart also performed informal experiments with humans to gauge the readability of Highleyman’s characters. On the full set of alphanumeric characters, they found an average error rate of 15.7%, about 2x better than their pattern recognition machine. But this rate was still quite high and suggested the data needed to be of higher quality. They (again prophetically) concluded that “an array size of at least 20X20 is needed, with an optimum size of perhaps 30X30.”

Decades passed until such a dataset appeared. Thirty years later, with 125 times as much training data, 28x28 resolution, and with grayscale scans, a neural net achieved 0.7% test error on the MNIST digit recognition task. In fact, a similar model to Munson’s architecture consisting of kernel ridge regression trained on pooled edged detectors also achieves 0.6% error. Intuition from the 1960s proved right. The resolution was higher and the number of examples per digit was now in the thousands, just as Bledsoe, Munson, Duda, and Hart predicted would be sufficient. Reasoning heuristically that the test error should be inversely proportional to the square root of the number of training examples, we would expect an 11x improvement over Munson’s approaches. The actual recorded improvement from 12% to 0.7% was closer to 17x, not far from what the back of the envelope calculation predicts.

Unlike Highleyman’s data, MNIST featured only digits, no letters. Only recently, in 2017, researchers from Western Sydney University extracted alphanumeric characters from the NIST-19 repository. The resulting EMNIST_Balanced dataset has 2400 examples in each of the 47 classes, with a class for all upper case letters, all digits, and some of the non-ambiguous lower case letters. Currently, the best performing model achieves a test error rate of 9.4%. While the dataset is still fairly new, this is only a 3x improvement over the methods of Munson, Duda, and Hart. Applying the same naive scaling argument as above, the increase in dataset size would predict a 7x improvement if such an improvement was achievable. Considering that the SRI team observed a human-error rate of 11% on Highleyman’s data, it is quite possible that an accuracy of 90% is close to the best that we can expect for recognizing handwritten digits without context.

The story of Highleyman’s data foreshadows many of the later waves of machine learning research. A desire for better evaluation inspired the creation of novel data. Dissemination of the experimental results on this data led to sharing in order for researchers to be content that the evaluation was fair. Once the dataset was distributed, others requested the data to prove their methods were superior. And then the dataset itself became enshrined as a benchmark for competitive testing. Such comparative testing led to innovations in methods, theory, and data collection and curation itself. We have seen this pattern time and time again in machine learning, from the UCI repository, to MNIST, to ImageNet, to CASP. The nearly forgotten history of Highleyman’s data marks the beginning of this pattern recognition research paradigm.

We are, as always, deeply indebted to Chris Wiggins for sending us Munson et al.’s paper after watching a talk by BR on the history of ML benchmarking. We also thank Ludwig Schmidt for pointing us to EMNIST.

## Addendum on our protagonist Bill Highleyman.

After posting this blog, we found some lovely recollections by Bill Highleyman about his thesis. It is remarkable how Bill invented so many powerful machine learning primitives—finding linear functions that minimize empirical risk, gradient descent to minimize the risk, train-test splits, convolutional neural networks—all as part of his PhD dissertation project. That said, Bill considered the project to be a failure. He (and Bell Labs) realized the computing of 1959 was not up to the task of character recognition.

After he finished his thesis, Bill abandoned pattern recognition and moved on to work on other cool and practical computer engineering projects that interested him, never once looking back. By the mid sixties Bill had immersed himself in data communication and transmission, and patented novel approaches to electrolytic printing and financial transaction hardware. He eventually ended up specializing in high-reliability computing. Though he developed many of the machine learning techniques we use today, he was content to leave the field and work to advance general computing to catch up with his early ideas.

It’s odd but not surprising that while every machine learning class mentions Rosenblatt, Minsky, and Papert, almost everyone we’ve spoken with so far has never heard of Bill Highleyman.

We worry Bill is no longer reachable as he seems to have no online presence after 2019 and would be 88 years old today. If anyone out there on has met Bill, we’d love to hear more about him. Please drop us a note.

And if anyone has any idea of where we can get a copy of his 1800 characters from 1959, please let us know about that too…

### TR21-146 | Sample-Based Proofs of Proximity | Guy Goldberg, Guy Rothblum

from ECCC papers

Suppose we have random sampling access to a huge object, such as a graph or a database. Namely, we can observe the values of \emph{random} locations in the object, say random records in the database or random edges in the graph. We cannot, however, query locations of our choice. Can we verify complex properties of the object using only this restricted sampling access? In this work, we initiate the study of \emph{sample-based} proof systems, where the verifier is extremely constrained; Given an input, the verifier can only obtain samples of uniformly random and i.i.d. locations in the input string, together with the values at those locations. The goal is verifying complex properties in sublinear time, using only this restricted access. Following the literature on Property Testing and on Interactive Proofs of Proximity (IPPs), we seek proof systems where the verifier accepts every input that has the property, and with high probability rejects every input that is \emph{far} from the property. We study both interactive and non-interactive sample-based proof systems, showing: - On the positive side, our main result is that rich families of properties / languages have sub-linear sample-based interactive proofs of proximity (SIPPs). We show that every language in $\mathcal{NC}$ has a SIPP, where the sample and communication complexities, as well as the verifier's running time, are $\widetilde{O}(\sqrt{n})$, and with polylog(n) communication rounds. We also show that every language that can be computed in polynomial-time and bounded-polynomial space has a SIPP, where the sample and communication complexities of the protocol, as well as the verifier's running time are roughly $\sqrt{n}$, and with a constant number of rounds. This is achieved by constructing a reduction protocol from SIPPs to IPPs. With the aid of an untrusted prover, this reduction enables a restricted, sample-based verifier to simulate an execution of a (query-based) IPP, even though it cannot query the input. Applying the reduction to known query-based IPPs yields SIPPs for the families described above. - We show that every language with an adequate (query-based) property tester has a 1-round SIPP with \emph{constant} sample complexity and logarithmic communication complexity. One such language is equality testing, for which we give an explicit and simple SIPP. - On the negative side, we show that \emph{interaction} can be essential: we prove that there is no \emph{non}-interactive sample-based proof of proximity for equality testing. - Finally, we prove that \emph{private coins} can dramatically increase the power of SIPPs. We show a strong separation between the power of public-coin SIPPs and private-coin SIPPs for Equality Testing.

### Some Statistical Gamma Fun

from Richard Lipton

Nothing takes place in the world whose meaning is not that of some maximum or minimum. — Leonhard Euler

Kasper Müller is a mathematics and data science writer for Medium, where he contributes primarily to the blogs Cantor’s Paradise and Towards Data Science. He wrote a nice article last April titled, “The Beautiful Gamma Function and the Genius Who Discovered It.”

Today we discuss the relevance of the gamma function to statistics and use statistics to suggest a new kind of estimate for it.

The “Genius” that Müller refers to is Leonhard Euler. Euler proved that for all integers ${n \geq 0}$,

$\displaystyle n! = \int_0^1 (-\ln s)^n ds = \int_0^\infty t^n e^{-t} dt,$

where the latter equation uses the substitution ${s = e^{-t}}$. The right-hand side produces a value for any complex number ${z = x + iy}$ in place of ${n}$ provided ${x > -1}$. This leads to the formal definition

$\displaystyle \Gamma(z) = \int_0^\infty t^{z-1} e^{-t} dt,$

whose analytic extension is defined everywhere except for ${z = 0, -1, -2, -3,\dots}$. Because ${\Gamma(z)}$ has no zeroes, its reciprocal is an entire function. One neat value is ${\Gamma(\frac{1}{2}) = \sqrt{\pi}}$. We will be mainly concerned with ratios of two values of ${\Gamma}$.

## What is Gamma For?

For all ${z}$ except the non-positive integers, ${\Gamma}$ obeys the formula

$\displaystyle \frac{\Gamma(z+1)}{\Gamma(z)} = z.$

Of course, this follows from ${\Gamma(z) = (z-1)!}$ for positive integers ${z}$. Also

$\displaystyle \frac{\Gamma(z+2)}{\Gamma(z)} = \frac{\Gamma(z+2)}{\Gamma(z+1)}\cdot\frac{\Gamma(z+1)}{\Gamma(z)} = (z+1)z.$

In general, for all ${a > 0}$,

$\displaystyle \frac{\Gamma(z+a)}{\Gamma(z)} \sim z^a \ \ \ \ \ (1)$

but there is a discrepancy. This and the lack of a simple explicit formula for ${\Gamma(z)}$ at all have always made the ${\Gamma}$ function seem opaque to me. Two notable values are ${\Gamma(\frac{1}{2}) = \sqrt{\pi}}$ and ${\Gamma(\frac{3}{2}) = \frac{\sqrt{\pi}}{2}}$.

The ${\Gamma}$ function is not even the only uniformly continuous interpolation of the factorial function. It is the unique one whose logarithm is a convex function. This is the first of many reasons given in Müller’s article for ${\Gamma}$ to be salient and beautiful, culminating in its relation to the Riemann zeta function given by

$\displaystyle \frac{\Gamma(\frac{s}{2})\zeta(s)}{\pi^{s/2}} = \frac{\Gamma(\frac{1-s}{2})\zeta(1-s)}{\pi^{(1-s)/2}}.$

Yet the log-convex uniqueness was proved only 99 years ago, and none of these tell me at a flash what the ${\Gamma}$ function is.

What is the simplest label for its corner of the sky? The leading example is the formula for the volume of a sphere of radius ${r}$ in ${n}$ dimensions:

$\displaystyle V_n = \frac{\pi^{n/2}}{\Gamma(n + \frac{1}{2})}r^n.$

But I wonder whether a different application is more fundamental. Since we are dealing with ${a = \frac{1}{2}}$ already here, let us define the function

$\displaystyle \Gamma_{1/2}(z) = \frac{\Gamma(z+\frac{1}{2})}{\Gamma(z)}.$

Noting ${\Gamma_{1/2}(z) \sim z^{1/2}}$ via (1), this is a tweak of the square-root function. Here are some values of it:

$\displaystyle \begin{array}{rcl} \Gamma_{1/2}(1) &=& \frac{\Gamma(1.5)}{\Gamma(1)} = \frac{\sqrt{\pi}/2}{1} = \frac{\sqrt{\pi}}{2}\\ \Gamma_{1/2}(2) &=& \frac{\Gamma(2.5)}{\Gamma(2)} = \frac{3\sqrt{\pi}/4}{1} = \frac{3\sqrt{\pi}}{4}\\ \Gamma_{1/2}(3) &=& \frac{\Gamma(3.5)}{\Gamma(3)} = \frac{15\sqrt{\pi}/8}{2} = \frac{15\sqrt{\pi}}{16}\\ \Gamma_{1/2}(4) &=& \frac{\Gamma(4.5)}{\Gamma(4)} = \frac{105\sqrt{\pi}/16}{6} = \frac{35\sqrt{\pi}}{32}\\ \end{array}$

Here is the significance:

For integer ${n \geq 1}$, the expected Euclidean norm of a vector of ${n}$ independent samples from the standard Gaussian distribution is

$\displaystyle \sqrt{2}\cdot\Gamma_{1/2}(\frac{n}{2}). \ \ \ \ \ (2)$

That’s it: Gamma gives the norm of Gaussians. The norm is of order ${\sqrt{n}}$ but not exactly. The ${\Gamma}$ function gives it exactly.

## An Inferior But Curious Estimate

The norm of ${n}$ independent Gaussians is called the chi distribution. Its square is the better-known chi-squared distribution. This idea is used in the statistical chi-squared test, but what follows is simpler.

We let ${X^2}$ stand for the square norm divided by ${n}$, so that ${X}$ stands for the Euclidean norm divided by ${\sqrt{n}}$. From (2) we have

$\displaystyle E[X] = \sqrt{\frac{2}{n}}\Gamma_{1/2}(\frac{n}{2}).$

We will estimate ${E[X]}$ a different way and use that to estimate ${\Gamma_{1/2}}$. First we note that since the vector entries ${z_i}$ are independent and normally distributed, we have the exact values

$\displaystyle \begin{array}{rcl} E[X^2] &=& \frac{1}{n}\sum_{i=1}^n E[z_i^2] = 1\\ Var[X^2] &=& \frac{1}{n^2} \sum_{i=1}^n Var[z_i^2] = \frac{1}{n^2}\sum_{i=1}^n (E[z_i^4] - E[z_i]^2) = \frac{1}{n}(3 - 1) = \frac{2}{n}. \end{array}$

Since we have ${E[X^2]}$, computing either ${E[X]}$ or ${Var[X]}$ suffices to get the other, by the relation ${Var[X] = E[X^2] - E[X]^2}$. Our also having ${Var[X^2]}$ enables estimating ${Var[X]}$ via the delta method, in a particular form I noticed here. The derivation requires no special properties of ${X}$:

$\displaystyle Var[X^2] \approx 4E[X]^2 Var[X] - Var[X]^2. \ \ \ \ \ (3)$

For our particular ${X}$ with ${Var[X] = 1 - E[X]^2}$, this yields a quadratic equation in ${y = E[X]^2}$:

$\displaystyle \frac{2}{n} \doteq 4y(1 - y) - (1-y)^2, \quad\text{so}\quad 5y^2 - 6y + 1 + \frac{2}{n} = 0 \quad\text{so}\quad y = \frac{6 + \sqrt{16 - \frac{40}{n}}}{10}.$

This yields

$\displaystyle \frac{2}{n}\Gamma_{1/2}^2(\frac{n}{2}) \doteq \frac{3}{5} + \sqrt{\frac{4}{25} - \frac{2}{5n}}.$

Changing variables to ${z = \frac{n}{2}}$ and rearranging, we get the estimate

$\displaystyle \Gamma_{1/2}(z) \doteq \sqrt{\frac{3z}{5} + \sqrt{\frac{4z^2}{25} - \frac{z}{5}}}.$

It has been traditional to estimate what we would call ${\Gamma_{1/2}(z+\frac{1}{2})}$ instead, so putting ${x = z + \frac{1}{2}}$ we finally get:

$\displaystyle \frac{\Gamma(x+1)}{\Gamma(x+\frac{1}{2})} \sim \sqrt{0.6x + 0.3 + 0.2\sqrt{8x^2 - 2x - 1.5}} ~. \ \ \ \ \ (4)$

As an estimate, this is barely competitive with the simple ${\sqrt{x + 0.25}}$ and far inferior to

$\displaystyle (x^2 + 0.5x + 0.125)^{1/4},$

which is the first of several estimates of the form ${p_k(x)^{1/2k}}$ given by Cristinel Mortici in 2010. But it is curious that we got a formula with nested radicals and non-dyadic coefficients from a simple statistical estimate. It makes us wonder whether formulas with nested radicals can be tuned for greater accuracy, and whether this might knock back to statistical estimation.

## Open Problems

Can vectors of Gaussian variables be leveraged to say further interesting things about the gamma function and its applications? What are your favorite properties of the gamma function?

[fixed missing “ds” in intro, typo n–>sqrt(n) at end of sentence with “divided by”]

by KWRegan at October 19, 2021 08:21 PM UTC

### TR21-145 | Revisiting a Lower Bound on the Redundancy of Linear Batch Codes | Omar Alrabiah, Venkatesan Guruswami

from ECCC papers

A recent work of Li and Wootters (2021) shows a redundancy lower bound of $\Omega(\sqrt{Nk})$ for systematic linear $k$-batch codes of block length $N$ by looking at the $O(k)$ tensor power of the dual code. In this note, we present an alternate proof of their result via a linear independence argument on a collection of polynomials.

### Thursday Oct 21st — Maxim Raginsky from UIUC

The next Foundations of Data Science virtual talk will take place on Thursday, Oct 21st at 10:00 AM Pacific Time (13:00 Eastern Time, 18:00 Central European Time, 17:00 UTC).  Maxim Raginsky from University of Illinois, Urbana-Champaign will speak about “Neural SDEs: Deep Generative Models in the Diffusion Limit”

Please register here to join the virtual talk.

Abstract: In deep generative models, the latent variable is generated by a time-inhomogeneous Markov chain, where at each time step we pass the current state through a parametric nonlinear map, such as a feedforward neural net, and add a small independent Gaussian perturbation. In this talk, based on joint work with Belinda Tzen, I will discuss the diffusion limit of such models, where we increase the number of layers while sending the step size and the noise variance to zero. I will first provide a unified viewpoint on both sampling and variational inference in such generative models through the lens of stochastic control. Then I will show how we can quantify the expressiveness of diffusion-based generative models. Specifically, I will prove that one can efficiently sample from a wide class of terminal target distributions by choosing the drift of the latent diffusion from the class of multilayer feedforward neural nets, with the accuracy of sampling measured by the Kullback-Leibler divergence to the target distribution. Finally, I will briefly discuss a scheme for unbiased, finite-variance simulation in such models. This scheme can be implemented as a deep generative model with a random number of layers.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

by dstheory at October 19, 2021 02:56 PM UTC

### Is MATH Ready for P=NP? Is Alexandra Fahrenthold Ready for P=NP?

(This post was inspired by Harry Lewis emailing me about his granddaughter.)

Harry Lewis's grand daughter Alexandra Fahrenthold (see both pictures) wants information
on how to claim the Millennial prize, so she will be ready.

This raises the question: How likely is it that Alexandra will resolve P vs NP (or perhaps some other Millennium problem if she wants to rebel against her grandfather)?

And more seriously:

1) Have we made progress on P vs NP? (I think not.)
(By we  I mean the community, not Harry and I or Harry and I and Alexandra,
for which the answer is a more definite NO.)

2) If not then why not?

3) How does this compare historically to other open problems in Math?

We will refer to progress made in solving an open problem, though that is a tricky notion since only after a problem is solved can you look back and say what was progress.  One might also count subcases (e.g., n=4 case of FLT) as progress even if they don't help lead to the final proof. I quote a letter from Harry Lewis to me upon reading a first draft of this post:
The one larger point I would suggest adding is to add my operational definition of progress: Progress is being made on a problem if, when the solution is published, it will cite work being published today. Of course that is “operational” only after the fact. Demillo Lipton Perlis at the end have a nice riff on this. The alchemists thought they were making progress on turning lead to gold but they weren’t, even though we know that was actually a solvable problem. Likewise jumping off of higher and higher buildings was not making progress toward heavier than air flight.

---------------------------------------------------------

1) Have we made progress on P vs NP?

a) I tell my students that we have made progress on ruling out certain techniques.
They laugh at that, at which point I decide to not tell them that my PhD thesis was about that sort of thing (oracles). I could say

Once you know what's not going to work you can concentrate one what is going to work.

But that sounds hollow since very few people are working on techniques that
might work (The Geometric Complexity Program, see here, is the only exception I know of.)

b) Are there any partial results? Ryan Williams showed that SAT (and also counting mod versions of it) cannot be done in time n^c and space n^{o(1)} where c is 2cos(2pi/7) (see here).  That is the only concrete lower bound on SAT that I know of.  Is it progress? Sam Buss and Ryan Williams later showed (see here) that, using current techniques, this cannot be improved. If that inspires new techniques that push it further, that would be great. So it is progress? Hard to know now.

c) There are some circuit lower bounds. One can debate if this is progress.
It will be a much better informed debate once the problem is solved.

So I would say VERY LITTLE PROGRESS.

------------------------------------------------
2) If not then why not?

a) It's only been open for 50 years. A drop in the mathematical bucket.
Counterargument: 50 years of 20th and 21st century mathematics is A LOT.

b) Sociology: The academic computer science conference-model induces us to get out a paper in time for the next conference deadline, and not think deeply about a problem.  Carl Smith thought that P vs NP would be solved by the son of a member of the communist party in the USSR (when there was a USSR) who did not have the pressure to get tenure and grants and such. He may be right.
Counterargument: there are some (1) mavericks who buck the system, and (2) people like Carl's son-of-a-party-member who are allowed to think deeply for years.

c) It's just dang hard! That's the real question. Paul Erdos said of the Collatz Conjecture:
Mathematics may not be ready for such problems.
Is that true of P vs NP as well?

----------------------------------
3) History and Philosophy.
(In college I once took the following four courses in one semester: History of Philosophy, Philosophy of History, Philosophy of Philosophy, History of History.)

Let's look at problems that were open and then solved:

a) The Three Greek Problems of Antiquity: Squaring the circle (given a circle, construct a square with the same area), doubling the cube (given a line that is the edge of cube, construct another line that is the edge of a cube with twice the volume), trisecting an angle (given an angle, construct two lines whose angle is 1/3 of the given angle), with a straightedge and compass. (When I first heard of this problem I wondered how knowing what direction was North would help trisect an angle.) Posed in roughly 400BC. Not clear what posed means in this context. Did the ask for a construction OR did they ask for EITHER a construction OR a proof that there wasn't one?

This might be the closest analogy to P vs NP: At the time the problem was stated
MATHEMATICS WAS NOT READY FOR SUCH PROBLEMS.
It took lots of new math, a better notation, and a different way of looking at numbers, to show that they  could not be done: Pierre Wantzel--doubling the cube (1837),Pierre Wantzel--trisection (1837), Lindemann-Weierstrass--squaring the circle (1882).
NOTE: Some sources list a fourth problem: constructing every regular polygon. Pierre Watnzel proved, in 1837, that a regular n-gon is constructible iff n is the product of a power of 2 and distinct Fermat  primes. (Why isn't Wantzel better known?)

b) Fermat's Last Theorem. Given its origin, not quite clear when it was posed but 1640's seems fair. This could not be solved when it was posed (On an episode of Dr. Who they claim that Fermat had a simple proof. Note that Dr. Who is fictional and their PhD (if they has one) is probably not in mathematics.)
MATHEMATICS WAS NOT READY FOR SUCH PROBLEMS
but not as much as the three Greek problems. Very steady progress on it, see  here. One of the real milestone was connecting it to other problems in Math. And then Wiles proved it in the 1990's. While the solution was a surprise when it happened it was not that much of a surprise.

QUESTION: Is P vs NP more similar to Greek3 or to FLT?

c) Peano Arithmetic (and similar systems) are incomplete. Hilbert's 2nd problem (1900) asked to show the axioms of PA were consistent. Godel (1931) showed this could not be done.  Moreover, there are TRUE statements about numbers that PA cannot prove. I think people mostly thought PA was complete so one of the innovations was to think it was incomplete.
MATHEMATICS WAS READY FOR SUCH PROBLEMS
but it took the boldness to think PA was incomplete to solve it.  The math needed was known when Hilbert posed the problem. But of course, how to put it together was still quite a challenge.

d) The Continuum Hypothesis, CH, is that there is no cardinality between N and R. Cantor in 1878 asked for a proof that CH was true. It was Hilbert's first problem in 1900.
When Hilbert posed this problem in 1900
MATHEMATICS WAS NOT QUITE  READY FOR SUCH PROBLEMS.
The math to solve it wasn't quite there, but wasn't so far off (of course, that's in hindsight). Godel's model L (1940) was brilliant, though Lowenhiem-Skolem had constructed models.  A model of set theory that was defined by levels was, I think, though of by Russell (though in a very diff way than L). When Cohen did a model where CH is false (1963) he invented forcing for Set Theory, though forcing had already been used in Recursion theory (The Kleene-Post construction of intermediary Turing degrees.)

e) Hilbert's tenth problem (1900): Find an algorithm that will, given a poly in many variables over Z, determine if it has a solution in Z.
MATHEMATICS WAS ALMOST READY FOR SUCH PROBLEMS.
I turns out that there is no such algorithm. Similar to CH: Once it was thought that it was unsolvable, the proof that it was unsolvable just took a few decades. However, it did need  the definition of computable to be pinned down.  Davis-Putnam-Robinson outlined what was needed in the 1950's,and Matiyasevich finished it in 1970.  While it required just the right combination of ideas, and lots of cleverness, the math needed was known.
CAVEAT: There are many restrictions of H10 that are still open. My favorite: is the following solvable: given k, does x^3 + y^3 + z^3 = k have a solution in Z? (See my blog post on this problem here.) For a survey of what is known about subcases see (1) my paper here, though it is has been argued that I am looking at the wrong subcases (see my blog post on this here), and (2) Bogdan Grechuk's paper here
CAVEAT: Matiyasevich has suggested that Hilbert really meant to ask about equations and solutions over  Q. That problem is still open. If it is unsolvable, that might be proven reasonably soon. If it is solvable, then
MATHEMATICS IS NOT READY FOR SUCH PROBLEMS.

f) The four color theorem. Posed in 1852 by Francis Guthrie, proven in 1976. Haken, Appel, and Koch (more on that last name later) did do some very impressive math to set the problem up, and the computer program to finish it off. When the problem was posed (1852) the computing power was not up to the task. So
COMPUTER SCIENCE WAS NOT READY FOR SUCH PROBLEMS.
Could the ideas to set it up have been done earlier? Maybe, but not much earlier. The result is often attributed to Haken and Appel, but actually there are two papers, and Koch is an author on the second one. Note that (1) Robertson, Sanders, Seymour, Thomas had a simpler, though still computer proof (1996), and (2) Werner Gonthier formalized the proof inside the Coq proof assistant in 2005.
CAVEAT: An open problem that is hard to state precisely is to come up with a non-computer proof.
CAVEAT: There is a non-computer proof that every planar graph is 4.5-colorable, see my blog post in this here. (No, this is not a joke. If it was I would make if funnier and claim there is a non-computer proof that every planar graph is 4 + 1/e colorable.)

g) Poincare Conjecture. Conjectured in 1904 and solved in 2002. To bad---if it was solved in 2004 it would be exactly 100 years. There was some progress on this all along so I don't know which step was the hard one though probably they were all hard. This one is harder for me to speculate on. When it was solved and Darling wanted to know why it was worth \$1,000,000 I told her that it says if something tastes and smells and feels like a sphere, its a sphere. She was unimpressed.  But back to our story:  in hindsight,
MATH WAS READY FOR SUCH PROBLEMS
since there was steady progress. I think of NOT READY as meaning NO progress, NO plan.

h) The Erdos Distance Problem: Show that for any n points in the plane the number of distinct distances is Omega(n/\sqrt{log n}). Not quite solved, but a big milestone was Gutz and Katz proof of Omega(n/log n). For that result
MATH WAS READY FOR SUCH PROBLEMS
Steady progress:  see the Wikipedia entry here. What's of interest to us is that there was a barrier result of Omega(n^{8/9}) by Ruzsa (apparently unpublished) that said the techniques being used could not do better-- so people, in short order, found new techniques.  Here is hoping that happens with P vs NP.

--------------------------------------------------------------------------------
Let's look at problems that are open and unsolved.

a) Collatz Conjecture (also called the 3x+1 conjecture). I asked
Jeff Lagarias, who is an expert on the problem:

Is it true? When will it be resolved? He said Yes and Never.

I once heard there has been NO progress on this problem, though I later  heard that Terry Tao has made some progress. In any case, not much progress has been made. Maybe Erdos was right.

QUESTION: Why does my spell checker think that Collatz is not a word?

b) Small Ramsey Numbers. I asked Stanislaw Radziszowski, who is an expert on Small Ramsey Numbers (he has a dynamic survey on small Ramsey numbers here

What is R(5)?  When will we know? He said 43 and Never.

Worse than being hard, I don't think any nice math has come out of trying to find R(5,5). Too bad. The coloring that gives the lower bound for R(4) and some (perhaps all) of the R(i,j) where i,j\le 4 can be derived from group theory. YEAH! But then connections to interesting math just... stopped. For now? Forever? Joel Spencer told me this is an example of the law of small numbers: patterns that hold for small numbers stop holding when the numbers get too big. (I've seen other things  called the law of small numbers as well.)
MATH MAY NEVER BE READY FOR SUCH PROBLEMS
If no interesting math comes out of the attempt to find the exact values of the Ramsey Numbers, then it is not a good problem.

Note:  The conversations about Collatz and R(5) were within 10 minutes of each other. Depressing day!

c) The Twin Primes Conjecture. Sieve methods have been used to get partial result. YEAH! Yitang Zhang showed there exists infinite x such that x and x + 70million (something like that are prime. YEAH. Its been gotten down to x, x+246 and with various assumptions x,x+12 or x, x+6). YEAH! but Sieve methods are known to NOT be able to prove the  conjecture. Dang it!
DO NOT KNOW IF MATH IS READY FOR SUCH PROBLEMS.
I think people are kind of stuck here. Much like P vs NP, though at least they have some partial results. By contrast, with regard to P vs NP we don't even have that (unless you count Ryan's lower bound on SAT---maybe you do).

Note: I found that information here which seems to be an Encyclopedia Britannica  website. I would have thought that, with the web and Wikipedia, they would be out of business. Good for them to still be in business!

d) I am not qualified to write about any of the Millennium prizes except P vs NP (am I even qualified for that?)  so I ask my readers to leave opinions (informed or not) about, for which of them,
MATH IS NOT READY FOR SUCH PROBLEMS
One of the people who worked on the Riemann Hypothesis said:

I do not recommend spending half your life on the Riemann Hypothesis.

That raises a different question: When do you give up? (topic for a different blog post).

e) I am also not qualified to write about the Hilbert Problems where are still unsolved. Note that some of them are not well enough defined  to ever be resolved (H6: Make Physics rigorous) and some are either solved or unsolved depending on who you ask (H4: Construct ALL metrics where lines are geodesics-- surely, he didn't mean ALL metrics. Probably right, but  stop calling me Shirley!) For a byte more about Hilbert's problems, including a few paragraphs on H4,  see my reviews of two books on them, here. Same as the last item- if you have an opinion (informed or not) about, for which of them that are though to be sort-of open, is math ready for them, leave a comment.

CODA: Alexandra will be working on Collatz this summer!
Let's wish her luck!

by gasarch (noreply@blogger.com) at October 17, 2021 07:59 PM UTC

### TR21-144 | Towards Uniform Certification in QBF | Leroy Chew, Friedrich Slivovsky

from ECCC papers

We pioneer a new technique that allows us to prove a multitude of previously open simulations in QBF proof complexity. In particular, we show that extended QBF Frege p-simulates clausal proof systems such as IR-Calculus, IRM-Calculus, Long-Distance Q-Resolution, and Merge Resolution. These results are obtained by taking a technique of Beyersdorff et al. (JACM 2020) that turns strategy extraction into simulation and combining it with new local strategy extraction arguments. This approach leads to simulations that are carried out mainly in propositional logic, with minimal use of the QBF rules. Our proofs therefore provide a new, largely propositional interpretation of the simulated systems. We argue that these results strengthen the case for uniform certification in QBF solving, since many QBF proof systems now fall into place underneath extended QBF Frege.

### Combining Diverse Feature Priors

In our new paper, we take a closer look at the design space of so-called feature priors—i.e., priors that bias the features that a model relies on. We show that models trained with diverse sets of such priors have less overlapping failure modes and can correct each other’s mistakes.

At the core of deep learning’s success is its ability to automatically learn useful features from raw data, e.g., image pixels. Consequently, the set of such learned features has a large impact on the deep learning model’s ability to generalize, especially when the deployment conditions deviate from the training environment. For example, a model that relies heavily on the background of an image may have trouble recognizing a cow on a beach if during training it was mostly encountering cows on grass.

So, how can one control the features that a model relies on? This is often accomplished by changing the model’s architecture or training methodology. For example, by choosing to use a convolutional neural network a model designer biases that model towards learning a hierarchy of spatial features. Similarly, by employing data augmentation during training, one biases the model towards features that are invariant to the particular augmentations used. In fact, researchers have recently explored explicitly manipulating the set of features that a model learns, via, for example, suppressing texture information by training on stylized inputs or by training with worst-case input perturbations to avoid brittle features.

Now, all of these design choices can be thought of as imposing feature priors, i.e., biasing a model towards learning a particular type of features. The question thus becomes: how can we explore and leverage this space of feature priors in a more systematic and purposeful manner?

## Feature Priors as Distinct Perspectives on Data

The core idea is to view different feature priors as distinct perspectives on the data. That is, since different sets of features are likely to generalize differently across inputs, by considering multiple such sets in tandem we can obtain a more holistic (and thus, hopefully, reliable) view on our data.

Setting up our case study: Let us focus our attention on two concrete feature priors: shape and texture. These two priors arise naturally in the context of image classification and will serve as the basis of our study. We impose these priors through deliberate construction of the dataset and architecture:

• Shape-based priors: We remove texture information from the images with the help of an edge detection algorithm. We use for this purpose two canonical edge detection algorithms from the computer vision literature: Canny and Sobel.
• Texture-based priors: We use a variant of the BagNet model, which limits the model’s receptive field to prevent the model from relying on global structures like shape.
Original
Sobel
Canny
BagNet
Visualizing different feature priors. Sobel and Canny suppress texture through edge detection while BagNet suppresses shape by limiting the receptive field.

Intuitively, these priors should correspond to very different sets of features. But are the views offered by these priors truly complementary? A simple way to measure this is to quantify the overlap of the failure modes of the models trained with the respective priors. Specifically, we measure the correlation of predicting the correct label for each pair of such models. We perform this analysis on a subset of CIFAR-10.

Correlation of predictions between pairs of models with different priors on a subset of CIFAR-10. The shape-biased and texture-biased models have the least correlated predictions.

It looks like these results match our intuition! Indeed, models corresponding to the same feature priors (but different random initialization) are relatively well correlated with each other. This also includes the case when we use two different shape biases. On the other hand, when we consider a pairing of a shape-biased model and a texture-biased model the predictions are the least correlated, that is, they make different mistakes on the test set.

## Combining feature priors

Since shape- and texture-biased models make different types of mistakes, can we leverage their diversity to improve our predictions?

### Ensembles

A natural way to examine that question is combining these models in an ensemble. This ensemble, for a given test input, evaluates both models on that input and then outputs the one prediction that is assigned the highest probability by the respective model. It turns out that, indeed, such an ensemble is significantly more accurate when we combine in this way models with different priors (as opposed to combining two models trained with the same prior, but with different initializations). Clearly, prediction diversity matters!

The maximum accuracy achieved when using a single model, an ensemble with two models with the same prior, and an ensemble with two models with different priors on the CIFAR-10 subset.

### Co-Training

So far, we demonstrated that models with diverse feature priors have less overlapping failure modes, and can be combined via ensembling for improved generalization performance. However, is that all? In particular, are there ways of incorporating prior diversity during training (as opposed to ensembling post hoc) in order to improve the learning process itself?

To answer this question, we focus on self-training, a methodology often employed when the labeled data is insufficient to learn a well-generalizing model alone, but a large pool of unlabeled data is available. The key idea in self-training is to use a model trained on the labeled data to produce “pseudo-labels” for the unlabeled data and then use these labels for further training. This setting is particularly challenging since: (a) the (original) labelled data points are typically too few to learn reliable prediction rules, and (b) any incorrect prediction rules learned will be reinforced via pseudo-labelling (so-called “confirmation bias”).

From this perspective, our goal is to jointly train models with different feature priors to mitigate the propagation of such incorrect prediction rules. We will do so by leveraging the well-established framework of co-training, a framework designed for learning from data with multiple independent sets of features. In the context of our study, we can instantiate this framework as follows.

First, we train one model for each prior using the labeled data. Then, we use each of these models to pseudo-label the unlabelled data and add the examples which are assigned the highest predicted probability to a joint pseudo-labelled data pool. We then use these examples to train both models further and keep repeating this process until we eventually use all the unlabeled data for training. In the end, we combine these models into a single classifier by training a standard model from scratch on the combined pseudo-labels.

The intuition behind this process is that by jointly training models which rely on different features, these models can learn from each other’s predictions. If one model produces incorrect pseudo-labels, we can hope that the other model will correct them by relying on alternative prediction rules.

So, how well does this approach work? To evaluate it, we extract from the CIFAR-10 dataset a small, labeled part (100 examples per class) and treat the rest of this dataset unlabeled. We then compare how different training methods—specifically, self-training a model with a single prior, and co-training models with different priors—perform in this setting. (For an additional baseline, we also consider ensembling two such models with different priors together.)

Test accuracy of models in pre-trained, self-trained, and co-trained settings. We consider: a single model alone, combining multiple models with the same prior, and combining models with diverse priors.

Overall, we find that co-training with shape- and texture-based priors can significantly improve the test accuracy of the final model compared to self-training with any of the priors alone. In fact, co-training models with diverse priors also improves significantly upon simply combining self-trained models in an ensemble. So these models are indeed able to take advantage of each other’s predictions during training.

## Priors and Spurious Correlations

So far, we were focused on a setting where the training and test data were all sourced from the same distribution. However, a major challenge for the real-world model deployment are spurious correlations: associations which are predictive on the training data but not valid for the actual task. For example, an image classification model may predict an object’s class based on its location, or rely on artifacts of the data collection process.

How can we train models that avoid picking up such spurious correlations? For this problem to be tractable in the first place, we need to rely on some information beyond the training data. Here, we will assume that we have access to an unlabelled dataset where this correlation does not hold (e.g., cows do not always appear on grass and thus the correlation “grass”->”cow” is not always predictive). This is a rather mild assumption in settings where we can easily collect unlabelled data from a variety of sources—if a correlation is spurious, it is less likely to be uniformly present.

As a concrete example, let us consider a simple gender classification task based on the CelebA dataset. In this task, we will introduce a spurious correlation into the labeled data by only collecting photographs of blond females and non-blond males. This makes hair color a good predictor of gender for the labeled dataset, but will not generalize well beyond that as such correlation does not hold in the real world.

Our goal here will be to harness an unbiased, yet unlabelled dataset, to learn a model that avoids this correlation. We will again attempt to do so by co-training models with diverse feature priors: shape and texture. Notice that since the spurious correlation is color-based, shape-biased models are likely to ignore it. As a result, we anticipate that the prediction of the shape-biased and texture-biased models will differ on inputs where hair color disagrees with gender. Thus, during co-training, these models are intuitively providing each other with counter-examples and are thus likely to steer each other away from incorrect prediction rules.

Accuracy of models with different feature priors on the (unbiased) CelebA test set. We consider the setting of using only the (biased) labeled data, as well as self-training and co-training using the (unbiased) unlabeled dataset. (For co-training, the combined model is a standard model trained on the combined pseudo-labels of the co-trained Canny and BagNet models.)

We find that this is indeed the case! When we co-train a texture-biased model with a shape-biased one, the texture-biased model improves substantially, relying less on the hair color. Moreover, the shape-biased model also improves through co-training. This indicates that even though the texture-biased model relies heavily on the spurious correlation, it also captures non-spurious associations that, through pseudo-labeling, are useful for the shape-based model too.

## Outlook: Exploring the Design Space of Priors

In this post, we described how models trained with diverse feature priors can be leveraged during training to learn more reliable prediction rules (e.g., in the presence of spurious correlations). However, we view our exploration as only the first step in systematically exploring the design space of feature priors. We believe that this direction will yield an important building block of reliable training and deployment pipelines.

### The Ideal State Machine Model: Multiple Clients and Linearizability

We introduced state machines and state machine replication in an earlier post. In this post, we elaborate on the exact notion of safety and liveness that can be obtained by an ideal state machine when there are multiple clients. First, these definitions highlight the challenges of serving multiple clients. Second,...

from David Eppstein

• I was interested to see a familiar-looking graph drawing as one of the answers to the prompt “multiplicity” for the first entry in Mathober 2021 ($\mathbb{M}$). It’s a multigraph formed by a triangle with tripled edges, and looks a lot like the drawing I made for the Wikipedia Shannon multigraph article, prettied up by making an infinitely recessing sequence of these drawings rather than just one. Good choice for multiplicity.

• Non-concentration of the chromatic number of random graphs ($\mathbb{M}$). Uniformly random graphs, $G(n,1/2)$ in the Erdős–Rényi–Gilbert model, turn out to have chromatic numbers that, for infinitely many $n$, are spread over roughly $\sqrt{n}$ values. But there are weird fluctuations so that, conjecturally, for some $n$ the concentration is much tighter, more like $n^{1/4}$.

• University System of Georgia moves to gut tenure ($\mathbb{M}$). The proposed new policy includes procedures for removal of tenure under certain enumerated grounds, including failure to perform their jobs (this is pretty normal) but then adds a massive escape clause in which the board of regents can remove tenured faculty at any time as long as their reason for doing so is not one of the enumerated ones.

• The first physical models of the hyperbolic plane, made in 1868 by Beltrami ($\mathbb{M}$, via), blog post by Daina Taimiņa from 2010. Maybe you could make something like this by wrapping and stretching a disk of wet paper in a roll around a pseudosphere (https://en.wikipedia.org/wiki/Pseudosphere)? The rolled-up photo of Beltrami’s model suggests that he did that. The via link shows this as a tangent to a story about triangulated polygons, frieze patterns, and the Farey tessellation.

• Why do bees make rhombic dodecahedrons ($\mathbb{M}$)? Nice video from Matt Parker (Stand-up Maths) on why bees usually end the cells of their honeycombs with rhombic dodecahedron faces, why this isn’t the optimal solution to fitting two layers of cells together (in terms of minimum wax usage), and why it isn’t reasonable to expect bees to find exact optima for this problem. If I have to quibble with something, though, it’s his plural. It’s not wrong, but see Google ngrams.

• Mathematicians prove melting ice stays smooth ($\mathbb{M}$, see also). The headline is a little overstated: you’re probably familiar with thin necks of ice melting to sharp points at the instant they separate. But these singularities are instantaneous: mathematical models of ice stay smooth for all but a measure-zero set of times. Original paper: “The singular set in the Stefan problem”, Alessio Figalli, Xavier Ros-Oton, and Joaquim Serra.

• Discussion of the recent meme telling programmers and mathematicians that summation notation and for loops are the same thing. They’re not quite the same, though: summations don’t have an order of evaluation. But which is easier for people who don’t already know what they are to search and find out about? And why do programmers get angry at non-programming notational conventions?

• Mathematics, morally ($\mathbb{M}$, via), Eugenia Cheng, 2004. Somehow I hadn’t run across this before. It argues that much philosophy of mathematics is irrelevant to practice (“You can’t tell from somebody’s mathematics if they are a fictionalist, a rationalist, a platonist, a realist, an operationalist, a logicist, a formalist, structuralist, nominalist, intuitionist.”) and instead considerations of the right way of handling certain topics are more central.

• The SIGACT Committee for the Advancement of Theoretical Computer Science is collecting information on women in theoretical computer science ($\mathbb{M}$). If this is you, please see their announcement for details of how to be counted.

• Cynthia Rudin wins major award with silly name ($\mathbb{M}$), for her work on machine learning systems that learn to predict behavior using simple, interpretable, and transparent formulas.

• According to the SODA web site, SIAM has decided that their conferences will be hybrid through July ($\mathbb{M}$). So if (like me) you wanted to participate in SODA/SOSA/ALENEX/APOCS, but were worried about planning a trip to Virginia with another predicted winter wave of Covid, now you can stay home and conference safely. Or, if you feel hybrid conferences are problematic and organizers should do one or the other but not both, now you have another reason to be annoyed.

• Rarely is the question asked: Are math papers getting longer? ($\mathbb{M}$). Following earlier work by Nick Trefethen, Edward Dunne provides some data suggesting that (for certain journals, at least, and not the ones with page limits) the answer is yes. I’m not convinced by the suggested explanation that it’s because they are taking more effort to explain “connections with other work”, though: is that really a big enough fraction of most papers?

• I haven’t been using my office desktop Mac much because I haven’t been into my office much, so it took me a while to pay attention to the fact that much of its networking had recently broken. Here’s why ($\mathbb{M}$). It was still running OS X El Capitan (10.11.6) and a crucial top-level certificate expired. The machine is too old (late 2009) for the latest OS X but it looks like I can and should upgrade to High Sierra, 10.13. So much for getting anything else accomplished today…

by David Eppstein at October 15, 2021 03:43 PM UTC

### A Young Person's Game?

When László Babai first announced his graph isomorphism in quasipolynomial time result, I wrote

We think of theory as a young person's game, most of the big breakthroughs coming from researchers early in their careers. Babai is 65, having just won the Knuth Prize for his lifetime work on interactive proofs, group algorithms and communication complexity. Babai uses his extensive knowledge of combinatorics and group theory to get his algorithm. No young researcher could have had the knowledge base or maturity to be able to put the pieces together the way that Babai did.

Babai's proof is an exceptional story, but it is exceptional. Most CS theorists have done their best work early in their career. I got myself into a twitter discussion on the topic. For me, I'm proud of the research I did through my forties, but I'll always be best known, research wise, for my work on interactive proofs around 1990. It would be hard to run a scientific study to determine cause and effect but here are some reasons, based on my own experiences, on why we don't see research dominated by the senior people in theory.

The field changes - Computation complexity has moved from a computational-based discipline to one now dominated by combinatorics, algebra and analysis. I'm not complaining, a field should evolve over time but it plays less to my strengths. It's hard to teach this old dog new tricks.
The fruit hanged lower - there were important problems with easier proofs available then not available now
Responsibilities - You have fewer as a PhD student, postdoc or assistant professor.
Family - becomes more of a focus.
Taking on new jobs - Many academics, though not all, take on administrative roles at their university or , or leave academics completely.
The young people have the new ideas - And older people get settled in their ways
The thrill is gone or at least decays - Your first theorem, your first talk, your first conference paper gives you a level of excitement that's hard to match.
Existentialism - The realization that while computing has no doubt changed the world, my research, for the most part, hasn't.
Cognitive Decline - Probably the most controversial but for me I find it hard to focus on problems like I used to. Back in the day I prided myself on knowing all the proofs of my theorems, now I can't even remember the theorems.

Honestly there is just nothing wrong with taking on new roles, writing books, surveys and blogs, focusing on teaching and mentorship and service and leaving the great research to the next generation.

by Lance Fortnow (noreply@blogger.com) at October 15, 2021 01:32 PM UTC

### Nakamoto's Longest-Chain Wins Protocol

In this post, we will cover Nakamoto’s Consensus protocol presented in the Bitcoin whitepaper. There are a lot of posts and videos that explain this seminal and simple protocol. Our goal in this post will be to intuitively piece out the need for different aspects of the protocol, such as...