Authors: Laurent Bulteau, Konrad K. Dabrowski, Noleen Köhler, Sebastian Ordyniak, Daniël Paulusma
Download: PDF
Abstract: A homomorphism $f$ from a guest graph $G$ to a host graph $H$ is locally
bijective, injective or surjective if for every $u\in V(G)$, the restriction of
$f$ to the neighbourhood of $u$ is bijective, injective or surjective,
respectively. The corresponding decision problems, LBHOM, LIHOM and LSHOM, are
well studied both on general graphs and on special graph classes. Apart from
complexity results when the problems are parameterized by the treewidth and
maximum degree of the guest graph, the three problems still lack a thorough
study of their parameterized complexity. This paper fills this gap: we prove a
number of new FPT, W[1]-hard and para-NP-complete results by considering a
hierarchy of parameters of the guest graph $G$. For our FPT results, we do this
through the development of a new algorithmic framework that involves a general
ILP model. To illustrate the applicability of the new framework, we also use it
to prove FPT results for the Role Assignment problem, which originates from
social network theory and is closely related to locally surjective
homomorphisms.
Authors: Mario Szegedy, Jingjin Yu
Download: PDF
Abstract: Given a set of terminals in 2D/3D, the network with the shortest total length
that connects all terminals is a Steiner tree. On the other hand, with enough
budget, every terminal can be connected to every other terminals via a straight
edge, yielding a complete graph over all terminals. In this work, we study a
generalization of Steiner trees asking what happens in between these two
extremes. Focusing on three terminals with equal pairwise path weights, we
characterize the full evolutionary pathway between the Steiner tree and the
complete graph, which contains intriguing intermediate structures.
Authors: Sun Yixuan, Zhu Zhehao
Download: PDF
Abstract: This paper proposed a method to judge whether the point is inside or outside
of the simple convex polygon by the intersection of the vertical line. It
determined the point to an area enclosed by two straight lines, then convert
the problem of determining whether a point is inside or outside of a convex
polygon into the problem of determining whether a point is inside or outside of
a quadrilateral. After that, use the ray method to judge it. The complexity of
this algorithm is O(1) to O(n). As the experimental results show, the algorithm
has fewer intersections and greatly improves the efficiency of the judgment.
Authors: Debashis Mukherjee
Download: PDF
Abstract: An incremental approach for computation of convex hull for data points in
two-dimensions is presented. The algorithm is not output-sensitive and costs a
time that is linear in the size of data points at input. Graham's scan is
applied only on a subset of the data points, represented at the extremal of the
dataset. Points are classified for extremal, in proportion with the modular
distance, about an imaginary point interior to the region bounded by convex
hull of the dataset assumed for origin or center in polar coordinate. A subset
of the data is arrived by terminating at until an event of no change in maximal
points is observed per bin, for iteratively and exponentially decreasing
intervals.
Authors: Carlos Mougan, Jose M. Alvarez, Gourab K Patro, Salvatore Ruggieri, Steffen Staab
Download: PDF
Abstract: Protected attributes are often presented as categorical features that need to
be encoded before feeding them into a machine learning algorithm. Encoding
these attributes is paramount as they determine the way the algorithm will
learn from the data. Categorical feature encoding has a direct impact on the
model performance and fairness. In this work, we compare the accuracy and
fairness implications of the two most well-known encoders: one-hot encoding and
target encoding. We distinguish between two types of induced bias that can
arise while using these encodings and can lead to unfair models. The first
type, irreducible bias, is due to direct group category discrimination and a
second type, reducible bias, is due to large variance in less statistically
represented groups. We take a deeper look into how regularization methods for
target encoding can improve the induced bias while encoding categorical
features. Furthermore, we tackle the problem of intersectional fairness that
arises when mixing two protected categorical features leading to higher
cardinality. This practice is a powerful feature engineering technique used for
boosting model performance. We study its implications on fairness as it can
increase both types of induced bias
Authors: Jianping Cai, Ximeng Liu, Jiayin Li, Shuangyue Zhang
Download: PDF
Abstract: Starting from the local structures to study hierarchical trees is a common
research method. However, the cumbersome analysis and description make the
naive method challenging to adapt to the increasingly complex hierarchical tree
problems. To improve the efficiency of hierarchical tree research, we propose
an embeddable matrix representation for hierarchical trees, called Generation
Matrix. It can transform the abstract hierarchical tree into a concrete matrix
representation and then take the hierarchical tree as a whole to study, which
dramatically reduces the complexity of research. Mathematical analysis shows
that Generation Matrix can simulate various recursive algorithms without
accessing local structures and provides a variety of interpretable matrix
operations to support the research of hierarchical trees. Applying Generation
Matrix to differential privacy hierarchical tree release, we propose a
Generation Matrix-based optimally consistent release algorithm (GMC). It
provides an exceptionally concise process description so that we can describe
its core steps as a simple matrix expression rather than multiple complicated
recursive processes like existing algorithms. Our experiments show that GMC
takes only a few seconds to complete a release for large-scale datasets with
more than 10 million nodes. The calculation efficiency is increased by up to
100 times compared with the state-of-the-art schemes.
Authors: Itay Levinas, Roy Scherz, Yoram Louzoun
Download: PDF
Abstract: Estimating the frequency of sub-graphs is of importance for many tasks,
including sub-graph isomorphism, kernel-based anomaly detection, and network
structure analysis. While multiple algorithms were proposed for full
enumeration or sampling-based estimates, these methods fail in very large
graphs. Recent advances in parallelization allow for estimates of total
sub-graphs counts in very large graphs. The task of counting the frequency of
each sub-graph associated with each vertex also received excellent solutions
for undirected graphs. However, there is currently no good solution for very
large directed graphs.
We here propose VDMC (Vertex specific Distributed Motif Counting) -- a fully distributed algorithm to optimally count all the 3 and 4 vertices connected directed graphs (sub-graph motifs) associated with each vertex of a graph. VDMC counts each motif only once and its efficacy is linear in the number of counted motifs. It is fully parallelized to be efficient in GPU-based computation. VDMC is based on three main elements: 1) Ordering the vertices and only counting motifs containing increasing order vertices, 2) sub-ordering motifs based on the average length of the BFS composing the motif, and 3) removing isomorphisms only once for the entire graph. We here compare VDMC to analytical estimates of the expected number of motifs and show its accuracy. VDMC is available as a highly efficient CPU and GPU code with a novel data structure for efficient graph manipulation. We show the efficacy of VDMC and real-world graphs. VDMC allows for the precise analysis of sub-graph frequency around each vertex in large graphs and opens the way for the extension of methods until now limited to graphs of thousands of edges to graphs with millions of edges and above.
GIT: https://github.com/louzounlab/graph-measures
Authors: Ming Ding, Rasmus Kyng, Peng Zhang
Download: PDF
Abstract: We give a nearly-linear time reduction that encodes any linear program as a
2-commodity flow problem with only a small blow-up in size. Under mild
assumptions similar to those employed by modern fast solvers for linear
programs, our reduction causes only a polylogarithmic multiplicative increase
in the size of the program and runs in nearly-linear time. Our reduction
applies to high-accuracy approximation algorithms and exact algorithms. Given
an approximate solution to the 2-commodity flow problem, we can extract a
solution to the linear program in linear time with only a polynomial factor
increase in the error. This implies that any algorithm that solves the
2-commodity flow problem can solve linear programs in essentially the same
time. Given a directed graph with edge capacities and two source-sink pairs,
the goal of the 2-commodity flow problem is to maximize the sum of the flows
routed between the two source-sink pairs subject to edge capacities and flow
conservation. A 2-commodity flow can be directly written as a linear program,
and thus we establish a nearly-tight equivalence between these two classes of
problems.
Our proof follows the outline of Itai's polynomial-time reduction of a linear program to a 2-commodity flow problem (JACM'78). Itai's reduction shows that exactly solving 2-commodity flow and exactly solving linear programming are polynomial-time equivalent. We improve Itai's reduction to nearly preserve the problem representation size in each step. In addition, we establish an error bound for approximately solving each intermediate problem in the reduction, and show that the accumulated error is polynomially bounded. We remark that our reduction does not run in strongly polynomial time and that it is open whether 2-commodity flow and linear programming are equivalent in strongly polynomial time.
Authors: Philip Bille, Inge Li Gørtz, Tord Stordalen
Download: PDF
Abstract: We consider the predecessor problem on the ultra-wide word RAM model of
computation, which extends the word RAM model with 'ultrawords' consisting of
$w^2$ bits [TAMC, 2015]. The model supports arithmetic and boolean operations
on ultrawords, in addition to 'scattered' memory operations that access or
modify $w$ (potentially non-contiguous) memory addresses simultaneously. The
ultra-wide word RAM model captures (and idealizes) modern vector processor
architectures.
Our main result is a simple, linear space data structure that supports predecessor in constant time and updates in amortized, expected constant time. This improves the space of the previous constant time solution that uses space in the order of the size of the universe.
Our result is based on a new implementation of the classic $x$-fast trie data structure of Willard [Inform.~Process.~Lett. 17(2), 1983] combined with a new dictionary data structure that supports fast parallel lookups.
Authors: Ting-Chun Lin, Min-Hsiu Hsieh
Download: PDF
Abstract: A locally testable code (LTC) is an error correcting code with a property
tester. The tester tests if a word is codeword by reading constant random bits
and rejects the word with probability proportional to the distance from the
word to the closest codeword. An important open question until recently is
whether there exist $c^3$-LTCs which are LTCs with constant rate, constant
relative distance and constant locality. In this work, we construct a new LTC
family using 1-sided lossless expanders and balanced products.
Authors: Xu T. Liu, Jesun Firoz, Sinan Aksoy, Ilya Amburg, Andrew Lumsdaine, Cliff Joslyn, Assefaw H. Gebremedhin, Brenda Praggastis
Download: PDF
Abstract: Hypergraphs offer flexible and robust data representations for many
applications, but methods that work directly on hypergraphs are not readily
available and tend to be prohibitively expensive. Much of the current analysis
of hypergraphs relies on first performing a graph expansion -- either based on
the nodes (clique expansion), or on the edges (line graph) -- and then running
standard graph analytics on the resulting representative graph. However, this
approach suffers from massive space complexity and high computational cost with
increasing hypergraph size. Here, we present efficient, parallel algorithms to
accelerate and reduce the memory footprint of higher-order graph expansions of
hypergraphs. Our results focus on the edge-based $s$-line graph expansion, but
the methods we develop work for higher-order clique expansions as well. To the
best of our knowledge, ours is the first framework to enable hypergraph
spectral analysis of a large dataset on a single shared-memory machine. Our
methods enable the analysis of datasets from many domains that previous
graph-expansion-based models are unable to provide. The proposed $s$-line graph
computation algorithms are orders of magnitude faster than state-of-the-art
sparse general matrix-matrix multiplication methods, and obtain approximately
$5-31{\times}$ speedup over a prior state-of-the-art heuristic-based algorithm
for $s$-line graph computation.
The UCI mathematics department had a departmental colloquium today given by Noga Alon, titled “The Polynomial Method and its Algorithmic Aspects”. One part of his talk that I found very interesting was a collection of easy-to-state combinatorial problems where the existence of a solution can be proved, but there is no known efficient (polynomial time) algorithm for finding the solution:
Let \(p\) be a prime number, and suppose that we are given two inputs: a sequence \(A=a_1,a_2,\dots a_k\) of elements of \(\mathbb{Z}_p\), of length \(k\lt p\), not necessarily distinct from each other, and a set \(B\), also consisting of \(k\) elements of \(\mathbb{Z}_p\), distinct but not ordered. Can we pair them up by assigning an ordering \(b_1,b_2,\dots\) to the elements of \(B\) so that all of the sums \(a_i+b_i\) are distinct? For instance, if all of the \(A\)’s are equal, then any ordering of the \(B\)’s will work.
Let \(G\) be a bipartite graph that has a perfect matching. Assign “non-degrees” to the vertices on one side of the bipartition, numbers that should not be the degree of the vertex. Can we delete some subset of the vertices on the other side, so that in the remaining graph each vertex’s degree is different from its non-degree? For instance, if the non-degrees are all zero, then \(G\) itself will work (it has nonzero degree because it has a perfect matching). If the non-degrees are all non-zero, then we can delete all of the vertices on the other side of the bipartition.
Let \(G\) be a planar graph with degree three at every vertex, and suppose also that each edge of \(G\) has a list of three colors available to it. Can we assign colors from these lists to the edges so that each vertex touches edges of three different colors?
Let \(C_1\) and \(C_2\) be proper colorings of a large \(d\)-dimensional grid, using \(q\ge d+2\) colors, and let \(S_1\) and \(S_2\) be subsets of the vertices of the grid that are far apart from each other (I am not sure how far this needs to be but it seems to be \(d\pm O(1)\) or maybe \(q\pm O(1)\)). Can we find a proper coloring of the whole grid that blends between the two colorings, agreeing with each \(C_i\) on \(S_i\)? This one comes from a new paper by Alon, Raimundo Briceño, Nishant Chandgotia, Alexander Magazinov, and Yinon Spinka, “Mixing properties of colourings of the \(\mathbb{Z}^d\) lattice”, Combinatorics, Probability and Computing 2021. Fewer colors will not always work.
In all of these cases, the answer to the existence problem is yes, but we don’t know of an efficient algorithm for finding the structure that is supposed to exist. Instead, Alon proves the existence non-constructively, through a combinatorial Nullstellensatz encoding the principle that low-degree polynomials don’t have many roots. For one-variable polynomials of degree \(d\), for instance, there cannot be more than \(d\) values at which the polynomial is zero. But the degree really has to be \(d\), and not merely at most \(d\), because the degree-zero constant-zero polynomial is zero everywhere.
The analogous principle Alon uses for higher numbers of variables is the following: Let \(p\) be a polynomial of degree \(d\) in the variables \(x_1,x_2,\dots x_n\), over some field, and let \(t_i\) be the exponents of a nonzero monomial \(\prod_i x_i^{t_i}\) of degree \(d\) in \(p\). Additionally, let \(S_i\) be (not necessarily distinct) sets of \(t_i+1\) elements. Then we can choose an element \(s_i\) in each \(S_i\) such that \(p(s_1,s_2,\dots)\) is non-zero. Alon surveys this principle and its applications in his paper “Combinatorial Nullstellensatz”, Combinatorics, Probability and Computing 1999.
The hard parts about using this method to prove existence in combinatorial problems such as the ones above are finding the right polynomial and proving that it has a nonzero coefficient at the right monomial. The problem on reordering with distinct pairwise sums, for instance, comes from Alon’s paper “Additive Latin transversals”, Israel J. Math. 2000 and uses the polynomial
\[\prod_{i\lt j}(x_i-x_j)(a_i+x_i-a_j-x_j)\]where each \(S_i=B\), the first terms in the product can only be nonzero if each \(x_i\) is chosen to be a distinct member of \(B\), and the second terms in the product can only be nonzero if the pairwise sums are different from each other. The coefficient of the monomial \(\prod x_i^{k-1}\) turns out to be (up to sign) exactly \(k!\), which is nonzero modulo \(p\) by the assumptions that \(p\) is prime and \(k\lt p\). The nonzero coefficients for the other two problems are not so easy to compute: they are the permanent (number of perfect matchings) and the number of 3-edge-colorings of the given graphs. For the case of 3-edge-colorings, the fact that this number is nonzero is one of the equivalent forms of the 4-color theorem.
The general problem of constructing a nonzero solution to an instance of the combinatorial Nullstellensatz whose polynomial is described by an arithmetic circuit is (according to Alon) as hard as inverting arbitrary one-way permutations, a standard cryptographic primitive. This suggests that a polynomial-time construction algorithm does not exist; if it did exist, it would break a lot of modern cryptography. On the other hand, proving that it does not exist would also prove \(\mathsf{P}\ne\mathsf{NP}\). Despite this relation to standard hard problems, there’s no strong reason for believing that any of the combinatorial problems outlined above is as hard as the general case, so maybe they might still have an efficient algorithm. If so, it would probably translate into an existence proof that is substantially different than the ones we already have for these problems.
In the last blog post, we covered the potential pitfalls of synthetic data without formal privacy guarantees, and motivated the need for differentially private synthetic data mechanisms. In this blog post, we will describe the select-measure-generate paradigm, which is a simple and effective template for designing synthetic data mechanisms. The three steps underlying the select-measure-generate paradigm are illustrated and explained below.
Mechanisms in this class differ primarily in their methodology for selecting queries and their algorithm for generating synthetic data from noisy measurements. The focus of this blog post is the final Generate step. Specifically, we will explore different ways in which one can model data distributions for the purpose of generating synthetic data, outlining the qualitative pros and cons of each method. We will then introduce the Marginal-Based Inference (MBI) repository that provides methods that, given some set of noisy measurements, enables users to generate synthetic data in a generic and scalable way.
Separating the Generate subroutine from the existing synthetic data generation mechanisms greatly simplifies the design space of new differentially private mechanisms. It allows the mechanism designer to focus on selecting the queries to maximize utility of the synthetic data, rather than how to generate synthetic data that explain the noisy measurements well. Both are challenging technical problems that require different techniques to solve, and MBI provides principled solutions to the latter problem, while exposing an interface that can be readily adopted by mechanism designers.
In this section we will introduce the main optimization problem that underlies several methods for the Generate subproblem, and provide a high-level overview of how each method attempts to solve this optimization problem. Let \( y = \mathcal{M}(D) \) be the noisy measurements obtained from running a privacy mechanism on a discrete dataset \( D \). Our goal is to post-process these noise measurements to obtain synthetic data that explains them well. In particular, we wish to minimize over the space of all datasets for one that maximizes the likelihood of the observations \( y \).^{2}
\[ \hat{D} \in \text{arg} \max_{D \in \mathcal{D}} \log \mathbb{P}[\mathcal{M}(D) = y] \]
This is a high-dimensional discrete optimization problem, and is generally intractable to solve in practice, even in low-dimensional settings. It is common to consider the relaxed problem that instead optimizes over the set of probability distributions \( \mathcal{S} \):
\[ \hat{P} \in \text{arg} \max_{P \in \mathcal{S}} \log \mathbb{P}[\mathcal{M}(P) = y] \label{eq1} \tag{1} \]
More generally, we can consider any objective function that measures how well \( P \) explains \( y \). The log-likelihood is a natural choice, although other choices are also possible and used in practice. In the special-but-common case where the mechanism is an instance of the Gaussian mechanism, we have \( \mathcal{M}(D) = f(D) + \mathcal{N}(0, \sigma^2)^k \) and \( \log \mathbb{P}[\mathcal{M}(P) = y] \propto - || f(P) - y ||_2^2 \). If \( f \) is a linear function of \( P \), then Problem \ref{eq1} is simply a quadratic program. In the subsequent subsections, we will describe different approaches to solve or approximately solve Problem \ref{eq1}.
Remark 1: The distribution learned from solving Problem \ref{eq1} will resemble the true data with respect to the statistics measured by \( \mathcal{M} \). It may or may not accurately preserve other statistics — that is data dependent.
Remark 2: The most common statistics to measure are low-dimensional marginals. A marginal for a subset of attributes counts the number of records in the dataset that match each setting of possible values. They are appealing statistics to measure because:
- They capture low-dimensional structure common in real world data distributions.
- Each cell in a marginal is a count, a statistic that is fairly robust to noise.
- One individual can only contribute to a single cell of a marginal, so all cells have low sensitivity and can be measured simultaneously with low privacy cost.
We can attempt to solve Problem \ref{eq1} directly by utilizing any algorithm for convex optimization over the probability simplex, such as multiplicative weights. This method works well in low-dimensional regimes, although quickly becomes intractable for higher-dimensional domains, where it is generally intractable to even enumerate all the entries of a single distribution \( P \), let alone optimize over the space of all distributions.
Until recently, variants of the direct method were the only general-purpose solutions available for this problem, and as a result, many mechanisms struggled to scale to high-dimensional domains. Recently, several methods have been proposed that attempt to overcome the curse of dimensionality inherent in the direct approach, which scale by imposing additional assumptions on the mechanism \( \mathcal{M} \) and/or by relaxing the optimization problem. A common theme is to restrict attention to a subset of joint distributions which have tractable representations. The sections below describe these more scalable methods, including the different (implicit) assumptions each method makes, as well as the consequences of those assumptions.
The first method we describe is PGM, which was a key component of the first-place solution in the 2018 NIST Differential Privacy Synthetic Data Competition and in both the first and second-place solutions in the follow-up Temporal Map Competition.
PGM scales by restricting attention to distributions that can be represented as a graphical model \( P_{\theta} \). The key observation of PGM is that when \( \mathcal{M} \) only depends on \( P \) through its low-dimensional marginals, then one of the optimizers of Problem \ref{eq1} is a graphical model with parameters \( \theta \). In this case, Problem \ref{eq1} is under-determined and typically has infinitely many solutions. It turns out that the solution found by PGM has maximum entropy among all solutions to the problem — a very natural way to break ties among equally good solutions. Remarkably, these facts are true for any dataset — they do not require the underlying data to be generated from a graphical model with the same structure [MMS21].
The parameter vector \( \theta \) is often much smaller than \( P \), and we can efficiently optimize it, bypassing the curse of dimensionality in this special case. The size of \( \theta \) and in turn the complexity of PGM depends on the mechanism \( \mathcal{M} \), and in the worst case is the same as the Direct method.^{3} However, in many common cases of practical interest, the complexity of PGM is exponentially better than that of Direct, in which case we can efficiently solve the optimization problem above, finding \( \theta \) and thus a tractable representation of \( \hat{P} \). The complexity ultimately depends on the size of the junction tree derived from the mechanism \( \mathcal{M} \), and understanding this relationship requires some expertise in graphical models. However, if we utilize this understanding to design \( \mathcal{M} \), we can avoid this worst-case behavior, as MST and PrivMRF do.
An alternative approach was proposed in the recent RAP paper. The key idea is to restrict attention to “pseudo-distributions” that can be represented in a relaxed tabular format. The format is similar to the one-hot encoding of a discrete dataset, although the entries need not be \( 0 \) or \( 1 \), which enables gradient-based optimization to be performed on the cells in this table. The number of rows is a tunable knob that can be set to trade off expressive capacity with computational efficiency. With a sufficiently large knob size, the true minimizer of the original problem can be expressed in this way, but there is no guarantee that gradient-based optimization will converge to it because this representation introduces non-convexity. Moreover, the search space of this method includes “spurious” distributions, so even the global optimum of relaxed problem would not necessarily solve the original problem.^{4} Despite these drawbacks, this method appears to work well in practice.
Among the iterative methods introduced by [LVW21] is GEM (Generative networks with the exponential mechanism), an approach inspired by generative adversarial networks. They propose representing any dataset as mixture of product distributions over attributes in the data domain. They implicitly encode such distributions using a generative neural network with a softmax layer. In concrete terms, given some Gaussian noise \( \mathbf{z} \sim \mathcal{N}(0, I) \), their Generate step outputs \( f_\theta(\mathbf{z}) \) where \( f \) is some feedforward neural network parametrized by \( \theta \). \( f_\theta(\mathbf{z}) \) represents a collection of marginal distributions for each individual attribute in the domain, which can be used to directly answer any k-way marginal query. Alternatively, one can sample directly from \( f_\theta(\mathbf{z}) \) if the goal is generate synthetic tabular data.
Note that the size of \( \mathbf{z} \) can be arbitrarily large, meaning that this generative network approach can theoretically be scaled up to capture any distribution \( P \). Moreover, [LVW21] show that one can achieve strong performance in practical settings even when \( \mathbf{z} \) is small, making such generative network approaches to scale in terms of both computation and memory. Howevever, as is commonly found in deep learning methods, this optimization problem is nonconvex.
Finally, GUM and APPGM do not search over any space of distributions, but instead impose local consistency constraints on the noisy measurements. These methods relax Problem \ref{eq1} to optimize over the space of pseudo-marginals, rather than distributions. The pseudo-marginals are required to be internally consistent, but there is no guarantee that there is a distribution which realizes those pseudo-marginals. As a result, the solution found by these methods need not be feasible in Problem \ref{eq1}. Nevertheless, we can attempt to generate synthetic data using heuristics to translate these locally consistent pseudo-marginals into synthetic tabular data. This approach was used by team DPSyn in both NIST competitions.
A qualitative comparison between the discussed methods is given in the table below.^{5}
Remark 3: Among the alternatives discussed here, only Direct and PGM can be expected to solve Problem \ref{eq1}. The alternatives fail to solve Problem \ref{eq1} in general, either from non-convexity, or from introducing spurious distributions to the search space. This distinguishing feature of PGM comes at a cost: the complexity can be much higher than the alternatives, and in the worst-case, will not be feasible to run. In such cases, one of the approximations must be used instead.
Direct | PGM | Relaxed Tabular | Generative Networks | Local Consistency | |
Search space includes optimum | Yes | Yes | Yes | Yes | Yes |
Search space excludes spurious distributions | Yes | Yes | No | Yes | No |
Convexity preserving | Yes | Yes | No | No | Yes |
Solves Problem \ref{eq1} | Yes | Yes | No | No | No |
Factors influencing scalability | Size of Entire Domain | Size of Junction Tree | Size of Largest Marginal | Size of Largest Marginal | Size of Largest Marginal |
Now that we have introduced the techniques underlying the Generate step, we will show how to utilize the implementations in the MBI repository to develop end-to-end mechanisms for differentially private synthetic data.
The input to any method for Generate is a collection of noisy measurements. We show below how to prepare these measurements in a format compatible with the methods for Generate implemented in the MBI repository. The measurements are represented as a list, where each element of the list is a noisy marginal (represented as a numpy array), along with relevant metadata including the attributes in the marginal and the amount of noise used to answer it. In the code snippet below, the selected marginals are hard-coded, but in general this list can be modified to tailor the synthetic data towards a different set of marginals.
The above code snippet is a 5-fold composition of Gaussian mechanisms with \( \sigma = 50 \), and hence the entire mechanism is \( \frac{5}{2 \sigma^2} = \frac{1}{1000} \)-zCDP.
Given measurements represented in the format above, we can readily generate synthetic data using one of several methods. For example, the code snippet below generates synthetic data that approximately matches the noisy measurements:
To generate synthetic data, we have to simply instantiate one of the inference engines imported. In the code snippet above, we use the FactoredInference engine, which corresponds to the PGM method. The other inference engines share the same interface, and can be used instead if desired.
Remark 4: By utilizing the inference engines implemented in MBI, end-to-end synthetic data mechanisms can be written with remarkably little code. This simple example required less than 25 lines of code, and more complex mechanisms can usually be written in a single file with less than 200 lines of code. As a result, future research can focus on the measurement selection subproblem, and new ideas can more rapidly be evaluated and iterated on.
We evaluated the quality of the synthetic data generated by measuring the error of the measured marginals. Interestingly, the synthetic data has lower error than the noisy marginals, with reductions in error up to 30% for the larger marginals, and around 3% for the smaller ones.
Remark 5: It is not surprising that the synthetic data enjoys lower error than the noisy marginals. Problem \ref{eq1} can be seen as a projection problem, and there is substantial theoretical [NTZ12] and empirical [LWK15, AAGK+19] evidence that solving this problem reduces error. Intuitively, the benefit arises due to the inconsistencies in the noisy observations that are resolved through the optimization procedure.
We can also use the synthetic data to estimate marginals we didn’t measure with the Gaussian mechanism. These estimates may or may not be accurate, it depends on the data and the marginal being estimated. For example, the error on the (sex, income>50K) marginal is around 0.02, while the error on the (education-num, occupation) marginal is about 0.5.
Remark 6: The fact that the synthetic data is not accurate for some marginals is not a limitation of the method used for Generate, but rather an artifact of what marginals were selected. Thus, it is clear that selecting the right marginals to measure plays a crucial role in the quality of the synthetic data. This is an important open problem that will be the topic of a future blog post.
In this blog post, we focused on the Generate step of the select-measure-generate paradigm. For the next blog post in this series, we will focus on state-of-the-art approaches to the Select sub-problem. If you have any comments, questions, or remarks, please feel free to share them in the comments section below. If you would like to try generating synthetic data with MBI, check out this jupyter notebook on Google Colab!
The Generate step is a post-processing of already privatized noisy marginals, and therefore the privacy analysis only needs to reason about the first two steps. ↩
Here we assume that \( \mathcal{M} \) is a mechanism with a discrete output space. In practice, this is always the case because any mechanism implemented on a finite computer must have a discrete output space. For continuous output spaces, interpret the objective function as a log density rather than a log probability. ↩
For example, this worst-case behavior is realized if all 2-way marginals are measured. While this can be seen as a limitation of PGM, it is known that generating synthetic data that preserves all 2-way marginals is computationally hard in the worst-case. ↩
This idea was refined into RAP^{softmax} in follow-up-work, which overcomes the latter issue, but does not resolve the non-convexity issue. ↩
These approximations were all developed concurrently, and systematic empirical comparisons between them (and PGM) have not been done to date. Some experimental comparisons can be found in [LVW21] and [MPSM21]. ↩
I (Prashant Nalini Vasudevan) have a postdoctoral position available at the Department of Computer Science at NUS. I am looking for someone who works on the foundations of cryptography, information-theoretic cryptography, and/or related areas of theoretical computer science. See the website below for details, and feel free to email me if you have questions.
Website: https://careers.nus.edu.sg/NUS/job/Kent-Ridge-Research-Fellow%2C-Theory-of-Cryptography-Kent/7003544/
Email: prashant@comp.nus.edu.sg
The Theory Fest workshops committee is soliciting proposals for workshops. Workshops will be held during the STOC conference week, June 20-24, 2022. The (updated) deadline for submitting proposals is Feb 15, 2022. Details on how to submit a proposal can be found at http://acm-stoc.org/stoc2022/callforworkshops.html.
The screenwriter Aaron Sorkin wrote an article on prioritizing "Truth over Accuracy". He tells stories from his movies The Social Network and Being the Ricardos, of where he moves away from accuracy to get to the truth of a situation.
My friend and teacher, the late William Goldman, said of his Academy Award-winning screenplay for All the President's Men, "If I'm telling the true story of the fall of the President of the United States, the last thing I'm going to do is make anything up." I understand what he meant in context, but the fact is, as soon as he wrote "FADE IN," he'd committed to making things up. People don't speak in dialogue, and their lives don't play out in a series of scenes that form a narrative. Dramatists do that. They prioritize truth over accuracy. Paintings over photographs.
As scientists we focus on accuracy, as we should in our scientific publications. However being fully accurate can distract from the "truth", the underlying message you want to say, particularly in the title, abstract and introduction of our papers
Even more so when we promote our research to the public. A science writer once lamented to me that scientists would focus too much on the full accuracy of the science and the names behind it, even though neither serves the reader well.
Reminds me of the recent Netflix movie Don't Look Up satirizes scientists trying to communicate an end-of-the-world event to an untrusting society. I wish it was a better movie but still worth watching just to see Leo DiCaprio and Jennifer Lawrence play scientists frustrated with their ability to communicate a true existential crisis to the government and the general public.
So how should we as scientists try to frame our messaging to get people onboard, particularly when we say things they don't want to hear? Most importantly, how do scientists regain trust in a world where trust is in short supply. Perhaps we should paint more and photograph less.
The list doesn’t destroy culture; it creates it. Wherever you look in cultural history, you will find lists—Umberto Eco
Luca Trevisan, Stefan Schmid, James Lee, Scott Aaronson, Michael Mitzenmacher, Omer Reingold, Lance Fortnow, David Eppstein are some of the top bloggers in theory. That is not “in theory” but “in the area of theory”.
This led us to spend some time while watching NFL football to put together a list of theory blogs for computer science. We insist that they are reasonably current. Our rule is: They must have some posts in 2021.
Our List
The next is the list:
Open Problems
We would like to get some feedback. First any typos in the above? Second any we should have included? Third, any we should have left out—they do not satisfy our 2021 rule?
In the past few months, I’ve twice injured the same ankle while playing with my kids. This, perhaps combined with covid, led me to several indisputable realizations:
Hence today’s post. I’m feeling a strong compulsion to write an essay, or possibly even a book, surveying and critically evaluating a century of ideas about the following question:
Q: Why should the universe have been quantum-mechanical?
If you want, you can divide Q into two subquestions:
Q1: Why didn’t God just make the universe classical and be done with it? What would’ve been wrong with that choice?
Q2: Assuming classical physics wasn’t good enough for whatever reason, why this specific alternative? Why the complex-valued amplitudes? Why unitary transformations? Why the Born rule? Why the tensor product?
Despite its greater specificity, Q2 is ironically the question that I feel we have a better handle on. I could spend half a semester teaching theorems that admittedly don’t answer Q2, as satisfyingly as Einstein answered the question “why the Lorentz transformations?,” but that at least render this particular set of mathematical choices (the 2-norm, the Born Rule, complex numbers, etc.) orders-of-magnitude less surprising than one might’ve thought they were a priori. Q1 therefore stands, to me at least, as the more mysterious of the two questions.
So, I want to write something about the space of credible answers to Q, and especially Q1, that humans can currently conceive. I want to do this for my own sake as much as for others’. I want to do it because I regard Q as one of the biggest questions ever asked, for which it seems plausible to me that there’s simply an answer that most experts would accept as valid once they saw it, but for which no such answer is known. And also because, besides having spent 25 years working in quantum information, I have the following qualifications for the job:
The purpose of this post is to invite you to share your own answers to Q in the comments section. Before I embark on my survey project, I’d better know if there are promising ideas that I’ve missed, and this blog seems like as good a place as any to crowdsource the job.
Any answer is welcome, no matter how wild or speculative, so long as it honestly grapples with the actual nature of QM. To illustrate, nothing along the lines of “the universe is quantum because it needs to be holistic, interconnected, full of surprises, etc. etc.” will cut it, since such answers leave utterly unexplained why the world wasn’t simply endowed with those properties directly, rather than specifically via generalizing the rules of probability to allow interference and noncommuting observables.
Relatedly, whatever “design goal” you propose for the laws of physics, if the goal is satisfied by QM, but satisfied even better by theories that provide even more power than QM does—for instance, superluminal signalling, or violations of Tsirelson’s bound, or the efficient solution of NP-complete problems—then your explanation is out. This is a remarkably strong constraint.
Oh, needless to say, don’t try my patience with anything about the uncertainty principle being due to floating-point errors or rendering bugs, or anything else that relies on a travesty of QM lifted from a popular article or meme!
OK, maybe four more comments to enable a more productive discussion, before I shut up and turn things over to you:
We invite applications for the position of Chair of the Department of Computer Science as a tenured Associate Professor or tenured Professor effective on or before September 1, 2022. Candidates should have a Ph.D. in the field of Computer Science.
Website: https://academicjobsonline.org/ajo/jobs/21010
Email: csc-chair-search@uvic.ca
Colonel Stok: Do you play chess?
Harry Palmer: Yes, but I prefer a game with a better chance of cheating.—Funeral in Berlin
Ken Regan is well known to us all, and is the co-author of this blog. He is in the Department of Computer Science and Engineering, University at Buffalo (SUNY)–as part of the theory group. He is also an International Master from the World Chess Federation (FIDE).
Ken’s Trip
Ken has used his ability in theory with his expertise in chess to study how to detect cheating in chess. Unfortunately people do currently cheat in chess, and so detecting them is an important problem for FIDE. Ken has just spent some time in Bologna, Italy at the International Chess Federation Fair Play Commission’s meeting. The group’s goal is: There is an increasing demand for fair play experts during chess events, and we want to provide the organizers with professionals who know how to collect evidence, apply law, use the state-of-the-art detection tools. And do it in the way that players are protected and public perception of how important the fair play is improved.
Ken is on the way back from Italy and soon will explain in detail what is the latest about cheating in chess.
Cheating In The Past
In the past cheating at chess started with machines like the famous Turk of 1770. This was a machine that claimed to play chess without human help. Even just playing legal moves would have been impressive, but the Turk played strong chess.
The secret to the Turk is it used a hidden human chess player to make its moves. The player was hidden inside the machine. The audience was allowed to examine the machine by opening and looking at parts of the machine. The audience member could only examine one part of the Turk at a time. The trick was that the player could move from one part of the Turk to another. At each time some parts of the Turk would be visible, but some part stayed invisible. This meant the player could move from one part to another of the Turk. This fooled the audience and made it seem “magical” that the mechanical system could play chess.
Cheating Now and in the Future
Cheating at chess became interesting again when computer programs started to play strong chess—cheating. The fundamental insight was that once programs could play master level chess, players could cheat and play at this level. The idea was simple: A player would not look at the board and select the next move. They would instead ask the program what move should they do. They would then use that move.
Of course a serious issue was how could the player ask a program to make the next move? If the game they were playing was offline, then the player should be able to use their laptop to remotely run the program. If the game was played in some public place, then the player might have to work harder to run the program. But a player could perhaps still do this without being detected.
Technology has been used by chess cheaters in several ways. The most common way is to use a chess program while playing chess remotely, such as on the Internet or in correspondence chess. Rather than play the game directly, the cheater simply inputs the moves so far into the program and follows its suggestions, essentially letting the program play for them. Electronic communication with an accomplice during face-to-face competitive chess is a similar type of cheating; the accomplice can either be using a computer program or else simply be a much better player than their associate.
Open Problems
The main open problem is still: How well can cheating be detected? The basic idea is simple: Suppose that the player named Carol plays a game in some tournament. We wish to determine whether Carol played her own moves or whether she used moves of some program. How can we do this? There are several issues that make this hard.
Ken will update us on the latest views of these and other issues.
Is it an irony that Lipton's 1000th post and 75th bday are close together? No. Its a coincidence. People use irony/paradox/coincidence interchangeably. Hearing people make that mistake makes me literally want to strangle them.
The community celebrated this milestone by having talks on zoom in Lipton's honor. The blog post by Ken Regan that announced the event and has a list of speakers is here. The talks were recorded so they should be available soon. YEAH KEN for organizing the event! We may one day be celebrating his 2000th blog post/Xth bday.
I will celebrate this milestone by writing on how Lipton and his work have inspired and enlightened me.
1) My talk at the Lipton zoom-day-of-talks was on the Chandra-Furst-Lipton (1983) paper (see here) that sparked my interest in Ramsey Theory, lead to a paper I wrote that improved their upper and lower bounds, and lead to an educational open problem that I posted on this blog, that was answered. There is still more to do. An expanded version of the slide talk I gave on the zoom-day is here. (Their paper also got me interested in Communication complexity.)
2) I read the De Millo-Lipton-Perlis (1979) paper (see here) my first year in graduate school and found it very enlightening. NOT about program verification, which I did not know much about, but about how mathematics really works. As an ugrad I was very much into THEOREM-PROOF-THEOREM-PROOF as the basis for truth. This is wrongheaded for two reasons (1) I did not see the value of intuition, and (2) I did not realize that the PROOF is not the END of the story, but the BEGINNING of a process of checking it- many people over time have to check a result. DLP woke me up to point (2) and (to a lesser extend) point (1). A scary thought: most results in math, once published, are never looked at again. So their could be errors in the math literature. However, the important results DO get looked at quite carefully. Even so, I worry that an important result will depend on one that has not been looked at much...Anyway, a link to a blog post about a symposium about DLP is here.
3) The Karp-Lipton theorem is: if SAT has poly sized circuits than PH collapses (see here), It connects uniform and non-uniform complexity. This impressed me but also made me thing about IF-THEN statements. In this case something we don't think is true implies something else we don't think is true. So--- do we know something? Yes! The result has been used to get results like
If GI is NPC then PH collapses.
This is evidence that GI is not NPC.
4) Lipton originally blogged by himself and a blog book came out of that. I reviewed it in this column. Later it became the Lipton-Regan blog, which also gave rise to a book, which I reviewed here. Both of these books inspired my blog book. This is a shout-out to BOTH Lipton AND Regan.
5) Lipton either thinks P=NP or pretends to since he wants people to NOT all think the same thing. Perhaps someone will prove P NE NP while trying to prove P=NP. Like in The Hitchhiker's Guide to the Galaxy where they say that to fly, you throw yourself on the ground and miss. I took Lipton's advice in another context: While trying to prove that there IS a protocol for 11 muffins, 5 students where everyone gets 11/5 and the smallest piece is 11/25, I wrote down what such a protocol would have to satisfy (I was sincerely trying to find such a protocol) and ended up proving that you could not do better than 13/30 (for which I already had a protocol). Reminds me of a quote attributed to Erdos: when trying to prove X, spend half your time trying to prove X and half trying to prove NOT(X).
6) Lipton had a blog post (probably also a paper someplace) about using Ramsey Theory as the basis for a proof system (see here). That inspired me to propose a potential randomized n^{log n) algorithm for the CLIQUE-GAP problem (see here). The comments showed why the idea could not work-- no surprise as my idea would have lead to NP contained in RTIME(n^{log n}). Still, it was fun to think about and I learned things in the effort.
The Faculty of Computer Science of HSE University (Moscow, Russia) accepts applications for postdoctoral positions in the following research areas: theoretical computer science, theoretical foundation of ML, interpretable knowledge discovery and explainable AI, complex systems and applied mathematics, formal concept analysis, process mining. The application deadline is January 31, 2022.
Website: https://cs.hse.ru/en/news/554897133.html
Email: fellowship@hse.ru
Mathematics is only a systematic effort of solving puzzles posed by nature—Shakuntala Devi.
source—note MoMath cap |
Peter Winkler is featured in the current issue of the New Yorker magazine. Peter is a famous Dartmouth mathematics professor. He is an expert on math puzzles, both for making them and solving them—see his book, simply titled Mathematical Puzzles.
A recent evening of mathematical dinner theater hosted by Peter is the subject of the article, which is by Dan Rockmore, also of the Dartmouth math department. This event was co-coordinated by Cindy Lawrence, who is executive director of the National Museum of Mathematics:
source (malfunctioned for us) |
The event was attended by a group of investors, economists. and teachers in a suitable alcove of a restaurant in Manhattan. Quoting Rockmore’s article:
Winkler began the evening’s program. The first course of math, delivered during the first course of dinner (a scattering of salads), was a statistics starter called Simpson’s paradox. The paradox explains how apparent biases in large samples can disappear in smaller ones. A famous example: For the University of California at Berkeley’s graduate programs in 1975, over all, men were admitted at a higher rate than women, but, program by program, women were admitted at a higher rate.
About Berkeley, see this first. We covered Simpson’s paradox here. The article goes on the mention the famous trick of how a biased coin can be made to simulate a fair coin by tossing it twice, and continues:
Winkler let loose with the last official mind bender, a gambling thought experiment involving a fictitious couple named Alice and Bob, who are famous in math circles. Each of them has a biased coin—fifty-one-per-cent chance of heads, forty-nine-per-cent chance of tails. They each start with a hundred dollars, flipping the coin and betting against the bank on the outcome. Alice calls heads every time; Bob calls tails. The puzzle: Given that they both go broke, which one is more likely to have gone broke first?
The article relates that most diners guessed Bob, but the correct answer is Alice. Former Times math and science columnist John Tierney, a serious enough math fan to have remarked once that the term “recreational mathematics” is a self-contradiction, reasoned it out: “The longer Alice plays, the less likely she is to go broke.” We finish Tierney’s intent by contraposing: hence given that she too went broke, the more likely it was she that played less time.
Peter is a long time friend of mine. See him in action here. He is a delight to talk to about many things—just about any part of math.
Peter has a paper with the title, “Seven Puzzles You Think You Must Not Have Heard Correctly.” I had already started writing this post when I looked up this paper to get a few more puzzle examples. I found this one:
#4 Unwanted Expansion:
Suppose you have an algebraic expression involving variables, addition, multiplication, and parentheses. You repeatedly attempt to expand it using the distributive law. How do you know that the expression doesn’t continue to expand forever? Comment: Note that applying the distributive law to, say, the outer product in , yields which has more parentheses than before.
I thought about this for a little. Then I looked up the answer.
One can analyze the expression in terms of depth of trees, but there’s an easier way: set all the variables equal to 2. The point of the distributive law is that its application doesn’t change the value of the expression. The value of the initial expression limits the size of anything you can get from it by expansion.
Peter then says: This proof of stopping is due to Dick Lipton. I must say that surprised me. I am getting old. Oops.
Let’s end with a fun puzzle about the dictionary that is also mentioned in the New Yorker article.
If each number between one and ten billion was written down in English, which odd number would appear first alphabetically?
Given that spaces do not count, here is the answer. The last part changes because the number must be odd. If spaces do count—as in many dictionaries of phrases—then what is the answer?
Applications are invited for research associate positions in Algorithms and Complexity, funded by the European Research Council (ERC) starting grant “New Approaches to Counting and Sampling” (NACS), led by Dr. Heng Guo in the School of Informatics, University of Edinburgh.
Website: https://elxw.fa.em3.oraclecloud.com/hcmUI/CandidateExperience/en/sites/CX_1001/job/3010/
Email: hguo at inf.ed.ac.uk
Exciting news, everyone! Jaan Tallinn, who many of you might recognize as a co-creator of Skype, tech enthusiast, and philanthropist, graciously invited me, along with a bunch of other nerds, to join the new Speculation Grants program of the Survival and Flourishing Fund (SFF). In plain language, that means that Jaan is giving me $200,000 to distribute to charitable organizations in any way I see fit—though ideally, my choices will have something to do with the survival and flourishing of our planet and civilization.
(If all goes well, this blog post will actually lead to a lot more than just $200,000 in donations, because it will inspire applications to SFF that can then be funded by other “Speculators” or by SFF’s usual process.)
Thinking about how to handle the responsibility of this amazing and unexpected gift, I decided that I couldn’t possibly improve on what Scott Alexander did with his personal grants program on Astral Codex Ten. Thus: I hereby invite the readers of Shtetl-Optimized to pitch registered charities (which might or might not be their own)—especially, charities that are relatively small, unknown, and unappreciated, yet that would resonate strongly with someone who thinks the way I do. Feel free to renominate (i.e., bring back to my attention) charities that were mentioned when I asked a similar question after winning $250,000 from the ACM Prize in Computing.
If you’re interested, there’s a two-step process this time:
Step 1 is to make your pitch to me, either by a comment on this post or by email to me, depending on whether you’d prefer the pitch to be public or private. Let’s set a deadline for this step of Thursday, January 27, 2022 (i.e., one week from now). Your pitch can be extremely short, like 1 paragraph, although I might ask you followup questions. After January 27, I’ll then take one of two actions in response: I’ll either
(a) commit a specified portion of my $200,000 to your charity, if the charity formally applies to SFF, and if the charity isn’t excluded for some unexpected reason (5 sexual harassment lawsuits against its founders or whatever), and if one of my fellow “Speculators” doesn’t fund your charity before I do … or else I’ll
(b) not commit, in which case your charity can still apply for funding from SFF! One of the other Speculators might fund it, or it might be funded by the “ordinary” SFF process.
Step 2, which cannot be skipped, is then to have your charity submit a formal application to SFF. The application form isn’t too bad. But if the charity isn’t your own, it would help enormously if you at least knew someone at the charity, so you could tell them to apply to SFF. Again, Step 2 can be taken regardless of the outcome of Step 1.
The one big rule is that anything you suggest has to be a registered, tax-exempt charity in either the US or the UK. I won’t be distributing funds myself, but only advising SFF how to do so, and this is SFF’s rule, not mine. So alas, no political advocacy groups and no individuals. Donating to groups outside the US and UK is apparently possible but difficult.
While I’m not putting any restrictions on the scope, let me list a few examples of areas of interest to me.
Two examples of areas that I don’t plan to focus on are:
Anyway, thanks so much to Jaan and to SFF for giving me this incredible opportunity, and I look forward to seeing what y’all come up with!
Note: Any other philanthropists who read this blog, and who’d like to add to the amount, are more than welcome to do so!
The department of Computer Science at the University of Vienna invites applications for a tenure-track assistant professor position in algorithms. The starting date is negotiable.
Website: https://univis.univie.ac.at/ausschreibungstellensuche/flow/bew_ausschreibung-flow?_flowExecutionKey=_c0BC6DB4A-6047-8758-BA4B-005110088D19_k7DD77CC1-13FE-009B-5AB9-3185BE466C0D&tid=88323.28
Email: monika.henzinger@univie.ac.at
Postdoctoral position available starting Fall’22. Want to do bold cutting-edge research on the foundations of Algorithms/AI/ML? Interested in living in the Washington, DC area? Shoot me your CV. Soft deadline: Jan 31.
Website: https://www.linkedin.com/jobs/view/2879807616/
Email: grigory@grigory.us
And a place to look for fallback info in case things go wrong
This is a deviation from the “blog invariants” that were maintained for 1,000 posts. We’ll be happy to have suggestions for style changes and general ideas going forward.
Here were the open links for the workshop. Rewritten afterward: Things went smoothly, despite even my “Heisenberg” feeling of uncertainty about whether I had altered the setup by testing the links beforehand. Dick’s talk for Princeton’s 2012 Turing centennial included among “Grand Challenges” getting AV equipment for lectures to work—well Zoom etc. are a decade-later evolution of that.
Again we thank all who spoke—note the enlarged list of speakers in the original post. We will do some combination of a curated video and followup blog post(s).
This is post 7*11*13 = 1001. The passcode—which should not be needed—remains 111317. Looking forward to a fun session, and thanks again to all speakers and other visitors.
[added update]
You can find plenty of online stories celebrating the release into the public domain of Winnie the Pooh (text, not film), Hemingway’s The Sun Also Rises, etc. But the situation for Ludwig Wittgenstein is more complicated. All his own writing is now public domain in life+70 countries, but the work of later editors might not be. Michele Lavazza explains (\(\mathbb{M}\), via).
Math rock (\(\mathbb{M}\), via), Jordi Lefebre.
Free abelian group (\(\mathbb{M}\)), now a Good Article at Wikipedia. This concept may seem trivial at first, wrapped in unnecessary formalism. But these things come up in a lot of seemingly-unrelated contexts: multiplication of rational numbers, addition of integer polynomials, lattices in the geometry of numbers, chains in homology theory, and divisors in algebraic geometry. I worried that its technicality might make it hard to pass GA but fortunately I got a sympathetic reviewer.
Quantum graph theory (\(\mathbb{M}\)): what happens when qubits have pairwise entanglements described by some graph? And which graphs can be transformed into each other by local quantum operations? Ken Regan explains.
Giant weeble in a Budapest plaza (\(\mathbb{M}\), via): it wobbles but it won’t fall down. But unlike the childhood toy, it does it despite being of uniform density, because of its special shape, rather than by having a much denser weight near one end. It’s theorized that tortoises have evolved a similar shape so that they can automatically right themselves. For more mathematical details, see Gömböc on Wikipedia
Dictionary of mathematical eponymy: The Laves Graph (\(\mathbb{M}\)). I did a lot of work in 2014 on this topic in Wikipedia, but it evidently needs more editing for accessibility. I did not know the part about Laves being sent to the army because the Nazis viewed him as sympathetic to Jews, and then pulled into a secret project to develop unobtanium for Göring, guarded by an alchemist who helpfully insisted on adding lizard bones to the alloys and dangling crystal spheres over them.
Our quarter started this week, online through Zoom, accessed in Canvas (\(\mathbb{M}\)). To set it up I have to:
Someone else in the discussion said that to manage their online classes they have to go through seven different web sites; my count is not nearly that high. Whatever happened to the science of user-friendly interfaces?
An early look at the impact of the Chinese Academy of Sciences journals warning list (\(\mathbb{M}\)). The CAS list does not seem helpful as a tool for identifying specific journals as predatory, because it lists so few, and one has to filter for political motivation, but maybe it can still find patterns of bad publishing practices. I wasn’t particularly surprised to see MDPI and Hindawi high on their hit list, but the presence of a major IEEE journal (IEEE Access) was more interesting.
The new web site of the journal Ars Combinatoria (\(\mathbb{M}\)), after the old site got taken over by gambling spammers. Thanks to Jannis Harder for finding it for me by tracing through DNS history and trying its old IP address; it wasn’t coming up on searches and I wouldn’t have thought of doing that. Despite being found again, the journal may be in trouble: MathSciNet doesn’t list anything from them in 2021 and the new site says that as of December 2021 the entire editorial board has resigned (with no explanation other than “ask the publisher”).
Short video of curve-folded squishable paper helix by Richard Sweeney (\(\mathbb{M}\), via). Sweeney, previously.
The insidious corruption of open access publishers (\(\mathbb{M}\)): Igor Pak looks at predatory publishing and MDPI through a deep examination of MDPI’s journal Mathematics. Ultimately he concludes that it is not actually a predatory journal, but that doesn’t mean that he thinks it is doing good work. Instead, he favors diamond open-access and in particular arXiv-overlay journals.
Colorful geometric designs for pasta interact interestingly with the pasta’s intricate shapes (\(\mathbb{M}\)), by David Rivillo.
Typical incoherence from Quanta’s attempts at making mathematics accessible (\(\mathbb{M}\)): “the Riemann hypothesis says that the Riemann zeta function equals zero whenever the real part of \(s\) equals \(\tfrac12\)”. No. No, it doesn’t. Fortunately they fixed it quickly. The actual result in question concerns proving better-than-trivial growth rate on the critical line of \(L\)-functions; see “Bounds for standard \(L\)-functions”, Paul D. Nelson, arXiv:2109.15230.
An unconstructable still life in Conway’s Game of Life (\(\mathbb{M}\)). The configuration shown in the link is stable (if its boundary is stabilized appropriately) but, if it appears in any pattern, it must always have been that way, in all predecessors of the pattern. As such, it answers the question “can every still life be constructed by gliders?” negatively.
Pop-Up Geometry: The Mathematics behind Pop-Up Cards (\(\mathbb{M}\)), new book by Joe O’Rourke, to appear this year. The link goes to Joe’s web site for the book. I haven’t yet seen the actual book, but it looks likely to be very interesting.
Plus an open Zoom mini-workshop Monday 1/17, 4–5:30pm ET
From Rich DeMillo |
Richard Lipton founded this blog 1,000 posts ago. He was not quite as young as in the photo. This blog is almost 13 years old but measures time by the count of posts. Time itself occurs only between posts. Unlike what we said here about counting birthdays, we are counting “1,000 posts ago” correctly—because the blog was created before the first post with the pages one can click on above.
Today we—including many from the community—also salute Dick’s 75th birthday.
The birthday was last September 6. A year ago, I thought the anniversaries could be made to coincide, but the time between posts flowed heavily. Much as a ship captain’s personal attributes become vested in the ship, I felt the birthday could best be hailed at the blog milestone. In this we take after the Queen of England. Elizabeth Windsor was born on April 21 (1926), but follows the centuries-old royal custom of celebrating the birthday in late spring.
It has been my great pleasure to know Dick for 35 years in-person and several years before that through his papers. His STOC 1980 paper with Rich DeMillo on the possible independence of P versus NP influenced me to focus on systems of logic as a graduate student. We talked most at STACS 1994 in Caen, France, and this led to a paper with four others in the next STACS. I gave an invited talk at Dick’s 2008 birthday celebration workshop—of which more below—and this began our closer association the next year. Dick and the blog are a fount of ideas, and if I say I wish I had time to develop them all, the answer is that our readers are always freely welcome to do so.
It is also my delight to have a slate of volunteer speakers for a mini-workshop this coming Monday (Martin Luther King Day, January 17) starting at 4pm Eastern Time (US). It will have the theme “Open Problems”—what else?—actually there is “else” as some of the brief talks will be personal notes or nuggets of work that influenced careers. It is open to the public—all readers are invited to attend on Monday via the following Zoom link. Update: things worked fine, and I have removed the link…
…except to note that the passcode I made was 111317. I intended the passcode to be the three primes that multiply to 1,001, thinking of what comes next, but those three multiply to 2,431—way far ahead, but see Lance Fortnow’s comment below. There is no “Waiting Room” and no restriction on when to come or go.
Here are speakers who have volunteered so far—alphabetical except those in overseas time zones beginning with India will go first:
And possibly a few more. Among the open problems, some may be “Favorite Problems” and some “Formative Problems.” The latter means ones that can shape the field during the coming decade. They should be both important and pliable. The latter word is a pun: it means flexible, but we also mean something a beginning researcher can ply into a dissertation and early career. Computational complexity has many problems that seem to have become rigid, but its sideways growth shows pliable ones are out there.
One of this blog’s first posts in February, 2009, was on conveying the key idea to solve a problem in the time of an elevator pitch. The same goes for the time to state a good problem. The talks, including pauses and time for questions, will aim to fit within an hour, or 90 minutes at most.
I am also thankful to have the following well-wishes thus far from around the community, including several more fellow bloggers.
More may be added, and our readers are welcome to add more in the comments.
Andrea LaPaugh became Dean of the Faculty at Princeton and served until her recent retirement. She has some further words from that perspective going back to the beginning of her 38 years at Princeton.
Dick was a major force in the development of the CS department at Princeton. His work was instrumental in the creation of the department as separate from Electrical Engineering, in recruiting faculty to grow the department, and in acquiring resources. He arrived in 1980 and immediately recruited David Dobkin as a full professor and me as an assistant professor. He continued to have major involvement in recruiting, not only in theoretical computer science, but across the CS disciplines. He was never chair, but worked first with Bruce Arden, chair of the Department of Electrical Engineering and Computer Science, and then with Bob Sedgewick, the first chair of the Department of Computer Science (after Dick recruited him) to implement a vision of a strong and well-rounded CS department. Of course Dick was not alone in this mission. David Dobkin in particular was an important partner. But from my perspective, Dick was the spark.
David Dobkin, also writing from Princeton, goes back a few years further still to Yale:
In 1973 when I arrived at Yale as a new faculty member with a fresh PhD, Dick Lipton was the person in the next office who had arrived perhaps a week earlier. He came by to introduce himself. After a bit of formalities, we began to discuss research problems and he asked me about NP-complete problems which were all the rage at the time. The only problem I knew about was the knapsack problem and so I described it.
This led to a discussion about integer programming (which neither of us had a good handle on) and then to a discussion about determining in which region of a subdivision of -space (by a set of hyperplanes) a new point was in. Here again, neither Dick nor I had much background and so we invented things as we went. I had this idea that we could understand things with visual help and so we took one of my moving boxes as our universe and the back of some tablets as planes and began to simulate how regions would be determined in 3d.
From this work emerged our first paper together which was presented at STOC ’74. This on planar subdivision searching opened up a field and led to a long and enjoyable collaboration that followed.
So, congratulations Dick on having escaped the world of cardboard planes in a moving box and headed on to a remarkable career in the field.
Rich DeMillo sent the following from summer sunshine and open water:
The last time I missed Dick’s birthday was 2008. In 2007, Dan Boneh, Merrick Furst, Santosh Vempala, and I sent the following 60th birthday symposium invitation to twenty of Lipton’s closest friends and collaborators:
The Lipton-60 Symposium was finally held in April 2008. It was less the retrospective I had imagined than a mile marker in a career that continued to accelerate in unexpected ways. A few years later, he was elected to the American Academy of Arts and Sciences. A few months after that, he was announced as winner of the 2014 Knuth Prize. Even more memorable was the other thing that draws us together here today: the launch in 2009 of the GLL blog that he now writes with Ken.
I have long since strayed from the mainstream of theory and algorithms, but most of the work we did together has found its way into GLL essays that illuminate in startling new ways material I thought I understood thoroughly. Our paper on the nature of proof that now appears in Harry Lewis’s collection of 46 of the most important (in his view) papers in computer science made its first appearance on the blog in 2009 and has been a recurring character in the “GLL universe” ever since. Our probabilistic algorithm for polynomial zero testing also entered GLL in 2009. For years I harbored vague resentment that Jack Schwartz was often cited in the Schwartz-Zippel Lemma for a result that Dick and I had published two years before in 1979, but the exposition in GLL, Ken’s subsequent commentary, and years later, a reader’s comment showing that Daniel Erickson’s 1974 Caltech dissertation had scooped us all, laid that to rest.
We both had a fascination with the idea of algorithm correctness—incorrectness, actually—that turned into entire fields in software engineering and cryptography. Those were documented in an “Un-Birthday” post a year ago:
Rich and I started our work together over four decades ago. A central theme of our work was correctness. We were concerned That programs might not work as planned. At the time it was not obvious that this was a major theme of our joint work. But looking back now I can see that it was.
Also recently, I managed to coax Dick into the fraught waters of election security. I stood back to watch how flustered researchers would respond to his (brilliant but unworkable) reconstructions of what it means to vote in a public election.
I have been delaying my plans to remove myself from day-to-day management duties to found a new department of cybersecurity at Georgia Tech, and I have missed Dick at my side since his retirement. There was always the possibility when Dick was in the room that someone’s long-held (but wrong) assumption would fall apart under a trademarked but understated Lipton assault. Michael Rabin (who also contributed to the Lipton-60 symposium) was fond of telling Dick on such occasions, “You are a very dangerous person.”
With this intimate gathering on January 17 to commemorate GLL 1,000 and Dick Lipton’s 75th Birthday, Ken Regan has slyly kept alive the tradition of meetings that take place months after the anniversary of Dick’s birth. Times are different. Virtual meetings have replaced the casual travel and camaraderie of 2008. I cannot attend in any form on January 17 since I am at present on a ship bound for the Drake Passage and Patagonia, and I will lose Internet connectivity soon. I hope that explanation excuses this longer-than-asked-for expression of affection and admiration for Dick, my friend and collaborator for nearly 50 years.
A meta open-problem: Which was the most pliable problem at the time this blog was founded? The Unique Games conjecture strikes us as a perfect example, but you readers may have other examples from then and now.
[fixed product 11*13*17, added more greetings and speakers, removed workshop link afterward]
The Computer Science (CS) Department and the Electrical and Computer Engineering (ECE) Department at Stony Brook University invite applications for a tenure-track/tenured faculty position in Quantum Information Science and Engineering with an expected start date of Fall 2022. Exceptionally qualified candidates at any rank are invited to apply.
Website: https://www.cs.stonybrook.edu/about-us/career/empire
Email: recruit@cs.stonybrook.edu
UMD invites applications for faculty positions in quantum science and information. Areas to be considered include quantum computation, quantum simulation, quantum information processing, quantum sensing, and quantum networking. Research can be experimental, theoretical or computational in nature. Successful applicants will be expected to maintain active research programs and teach.
Website: https://ejobs.umd.edu/postings/91259
Email: gene@umd.edu
We’d like to encourage everyone to nominate outstanding papers in any area of theoretical computer science for the 2022 Gödel Prize.
In short, papers that first appeared since 2009 (in any form) and appeared in a refereed journal by 2021 are eligible. If you wish to nominate a paper, or might wish to do so, or if you want to suggest a paper informally, please contact the award committee chair, Samson Abramsky (s.abramsky@ucl.ac.uk). The deadline is February 28, 2022, but please contact the chair well in advance to coordinate efforts. For details, see https://sigact.org/prizes/g%C3%B6del.html
Guest post by Boaz Barak and Jelani Nelson
In a recent post, Lance Fortnow critiqued our open letter on the proposed revisions for the California Mathematics Framework (CMF). We disagree with Lance’s critique, and he has kindly allowed us to post our rebuttal here (thank you Lance!).
First, let us point out the aspects where we agree with both Lance and the authors of the CMF. Inequality in mathematical education, and in particular the obstacles faced by low-income students and students of color, is a huge problem in the US at large and California in particular. As a Black mathematician, this portion of the CMF’s introduction particularly resonated with me (Jelani):
Girls and Black and Brown children, notably, represent groups that more often receive messages that they are not capable of high-level mathematics, compared to their White and male counterparts (Shah & Leonardo, 2017). As early as preschool and kindergarten, research and policy documents use deficit-oriented labels to describe Black and Latinx and low-income children’s mathematical learning and position them as already behind their white and middle-class peers (NCSM & TODOS, 2016).
We agree with the observation that bias in the public education system can have a negative impact on students from underrepresented groups. Where we strongly oppose the CMF though is regarding their conclusions on how to address this concern.
The CMF may state that they are motivated by increasing equity in mathematics. However, if we read past the introduction to the actual details of the CMF revisions, then we see they suffer from fundamental flaws, which we believe if implemented, would exacerbate educational gaps, and in particular make it harder for low-income students and students of color to reach and be successful in college STEM.
You can read our detailed critique of the CMF, but the revisions we take issue with are:
1 and 2 make it all but impossible for students that follow the recommended path to reach calculus (perhaps even pre-calculus) in the 12th grade. This means that such students will be at a disadvantage if they want to pursue STEM majors in college. And who will be these students? Since the CMF is only recommended, wealthier school districts are free to reject it, and some already signalled that they will do so. Within districts that do adopt the recommendations, students with means are likely to take private Algebra I courses outside the curriculum (as already happened in San Francisco), and reject the calculus-free “data science” pathway. Hence this pathway will amount to a lower-tier track by another name, and worse than now, students will be tracked based on whether their family has the financial means to supplement the child’s public education with private coursework.
Notably, though the CMF aims to elevate data science, we’ve had several data science faculty at the university level express disapproval of the proposal by signing our opposition letter, including a founding faculty member of the Data Science Institute at UCSD, and several others who are directors of various undergraduate programs at their respective universities, including four who direct their universities' undergraduate data science programs (at Indiana University, Loyola University in Chicago, MIT, and the University of Wisconsin)!
One could say that while the framework may hurt low-income or students of color who want to pursue STEM in college, it might help other students who are not interested in STEM. However, interest in STEM majors is rapidly rising, and with good reasons: employment in math occupations is projected to grow much faster than other occupations. With the increasing centrality of technology and STEM to our society, we urgently need reforms that will diversify these professions rather than the other way around.
As a final note, Lance claimed that by rejecting the CMF, we are “defending the status quo”. This is not true. The CMF revisions are far from the “only game in town” for improving the status quo in mathematics education. In fact, unlike these largely untested proposals, there is a history of approaches that do work for teaching mathematics for under-served populations. We do not need to change the math itself, just invest in more support (including extracurricular support) for students from under-resourced communities. For example, Bob Moses’ Algebra Project has time and again taken the least successful students according to standardized exams, and turned them into a cohort that outperformed state averages in math. One of our letter’s contact people is Adrian Mims, an educator with 27 years of experience, whose dissertation was on "Improving African American Achievement in Geometry Honors" and who went on to found The Calculus Project, a non-profit organization creating a pathway for low-income students and students of color to succeed in advanced mathematics.
To close, a critique of the proposed CMF revision is not a defense of the status quo. Even if change is needed, not all change is good change, and our letter does make some recommendations on the front, one of which is a matter of process: if a goal is to best prepare Californian youth for majors in data science and STEM more broadly, and ultimately careers in these spaces, then involve college-level STEM educators and STEM professionals in the Curriculum Framework and Evaluation Criteria Committee.