Realizable Learning is All You Need

The equivalence of realizable and agnostic learnability is a fundamental phenomenon in learning theory. With variants ranging from classical settings like PAC learning and regression to recent trends such as adversarially robust learning, it's surprising that we still lack a unified theory; traditional proofs of the equivalence tend to be disparate, and rely on strong model-specific assumptions like uniform convergence and sample compression. In this work, we give the first model-independent framework explaining the equivalence of realizable and agnostic learnability: a three-line blackbox reduction that simplifies, unifies, and extends our understanding across a wide variety of settings. This includes models with no known characterization of learnability such as learning with arbitrary distributional assumptions and more general loss functions, as well as a host of other popular settings such as robust learning, partial learning, fair learning, and the statistical query model. More generally, we argue that the equivalence of realizable and agnostic learning is actually a special case of a broader phenomenon we call property generalization: any desirable property of a learning algorithm (e.g. noise tolerance, privacy, stability) that can be satisfied over finite hypothesis classes extends (possibly in some variation) to any learnable hypothesis class.


Introduction
The equivalence of realizable and agnostic learnability in Valiant's Probably Approximately Correct (PAC) model [1] is one of the best known results in learning theory, and numbers among its most surprising. Given a set X and a family of binary classifiers H, the result states that the ability to learn a classifier h ∈ H from examples of the form (x, h(x)) is in fact sufficient for something much stronger: given samples from any distribution D over X × {0, 1}, it is possible to learn the best approximation to D in H. This surprising equivalence stems from a classical result of Vapnik and Chervonenkis (VC) [2], and independently Blumer, Ehrenfeucht, Haussler, and Warmuth (BEHW) [3] and Haussler [4], who equate both the former model (known as realizable learning) and the latter model (known as agnostic learning) to a strong property of pairs (X, H) called uniform convergence. 1 VC, BEHW, and Haussler's result was certainly a breakthrough in its own right, but its proof technique is too indirect to reveal any deeper connections between realizable and agnostic learning beyond the PAC setting. Further, recent years have seen both theory and practice shift away not only from this original formalization, but more generally from the "uniform convergence equals learnability" paradigm, often in favor of distributional or data-dependent assumptions like margin that are more applicable to the real world. The inability of VC, BEHW, and Haussler's proof technique to generalize to such scenarios raises a fundamental question: is the equivalence of realizable and agnostic learning a fundamental property of learnability, or simply a happy coincidence derived from the original PAC framework?
In the 30 years since these works, a mountain of evidence has amassed in favor of the former: almost every reasonable variant of learning shares some sort of similar equivalence. This includes a long list of popular settings such as regression [5], distribution-dependent learning [6], multi-class learning [7], robust learning [8], online learning [9], private learning [10,11], and partial learning [12,13]. What's more, the uniform convergence paradigm fails miserably in most of these models. In the distribution-dependent model, for instance, it is easy to build classes which are trivially learnable (even with one sample!) but completely fail to satisfy uniform convergence [6]. On the other hand, models such as private learning give well-known examples where uniform convergence fails to imply learnability [14]. In spite of this, we are really no closer today to a general understanding of this phenomenon than we were in the early 90s. Much like Vapnik and Chervonenkis [2], Blumer, Ehrenfeucht, Haussler, and Warmuth [3], and Haussler's [4] proofs, the above works often use indirect methods and tend to rely on powerful model-dependent assumptions.
In this work, we aim to offer a generic, unifying theory by way of the first direct reduction from agnostic to realizable learning. Unlike any previous work, our reduction is blackbox, relies on no additional assumptions, and, perhaps most importantly, is incredibly simple. In fact, the basic algorithm can be stated in three lines. 3. Return the hypothesis in C(S U ) with lowest empirical error over S L .
This basic reduction simplifies and unifies classic results such as VC [2], BEHW [3], and Haussler's [4] distribution-free equivalence and Benedek and Itai's [6] analogous result in the distribution-dependent setting, 2 with no loss in sample-complexity. 3 Moreover, because Algorithm 1 doesn't rely on model-dependent properties like uniform convergence, it extends to learning regimes without known characterizations. One such example is the notoriously difficult distribution-family model, in which the adversary is given a restricted family of distributions D along with the pair (X, H). While no characterization of learnability is known in this model, Algorithm 1 can still be used to show that the realizable and agnostic settings are equivalent.
Unfortunately, while Algorithm 1 does avoid any significant blowup in sample complexity, it is inherently computationally inefficient. In fact, this is necessary unless P = N P . There are many basic classes (e.g. halfspaces) which are easy to learn in the realizable model, but NP-hard in the agnostic setting (see e.g. [16]). 4 As a result, we focus in this work only on information theoretic considerations, though building computationally efficient reductions under restricted settings remains a very interesting avenue of research.
At a technical level, the core of Algorithm 1 is a new equivalence between PAC learning and a type of randomized covering we call a non-uniform cover. In contrast to more classical notions, a non-uniform cover is a distribution over subsets of hypotheses that covers any fixed hypothesis in the class with high probability, but may fail to cover all hypotheses simultaneously. The connection between supervised learning and nonuniform covering is inherent in Algorithm 1, where Steps 1 and 2 turn the realizable learner A into a non-uniform cover C(S U ), and Step 3 uses the cover to perform agnostic learning. At a high level, this process works because the adversary does not see the randomness inherent to Steps 1 and 2, and therefore cannot detect or exploit which hypotheses in the class will fail to be well-estimated in the process.
In fact, before moving to analysis of Algorithm 1, it is worth mentioning this connection has much broader implications. By replacing Step 3 with learners for various other properties over finite classes (e.g. the exponential mechanism for privacy), it is possible to reduce many learning problems to the basic realizable setting. At a high level, this can be summarized by the following informal 'guiding principle': Guiding Principle (Property Generalization). If there is a (sample-efficient) algorithm for property P over finite classes, then Algorithm 1 gives a (sample-efficient) learner with property P over any 'learnable' class.
We stress that the above is a guide, not a theorem, and indeed often requires modification for a given application (e.g. it is well known that private and realizable learning are not equivalent, but applying the principle can still result in a number of weaker variants of privacy). Nevertheless, there are many examples of property generalization in the literature (e.g. for malicious noise [17], nasty noise [18], or even unrelated properties such as privacy [19] and robustness [8]). Algorithm 1 provides a unified framework for such results and helps extend many beyond the standard PAC-model. This includes basic extensions such as general loss functions 5 and the distribution-family model, but also more involved modifications such as partial learning, robust learning, or even the statistical query model. Moreover, in some of these settings removing reliance on setting-specific assumptions like uniform convergence actually quantitatively improves the sample complexity as well; such is the case for semi-private learning, where we use this fact to achieve information-theoretically optimal unlabeled sample complexity for the first time.

The Basic Reduction
Since all of our results are derived from variants of Algorithm 1, it is instructive to start by considering its basic analysis in our simplest non-trivial setting: distribution-family classification. This framework captures learnability with arbitrary distributional assumptions, a well-studied relaxation of PAC learning in practice where worst-case distributional assumptions are often too strong, and encompasses both the distribution-free and distribution-dependent PAC settings. Unlike these models, however, the distribution-family setting has no known characterization of learnability: uniform convergence is not necessary [6] as in the former, and finite coverability is not sufficient [20] as in the latter. Indeed, it is plausible no combinatorial characterization of this model exists at all, as it shares characteristics with EMX learnability which was recently shown to be independent of the ZFC set theory axioms [21]. As a result, we cannot hope to prove the equivalence of agnostic and realizable learning in this model by finding some common characterization.
With this in mind, let's define distribution-family learning a bit more formally. Let X be a set (called the instance space), Y = {0, 1} the set of binary labels, D a family of distributions over X, and H = {h : X → Y } a family of binary classifiers. A tuple (D, X, H) is said to be realizably learnable if there exists an algorithm 6 A and a function n(ε, δ) such that for every ε, δ > 0, every choice of distribution D ∈ D, and every hypothesis h ∈ H, A outputs a good classifier with high probability on samples of size n(ε, δ): where err D×h (A(S, h(S))) is commonly called the error or risk of h: Likewise, a tuple (D, X, H) is said to be agnostically learnable if there exists an algorithm A which for every distribution D over X × Y whose marginal D X ∈ D outputs h ′ close to the best hypothesis in H with probability 1 − δ: Pr With this in mind, we can now state the most basic application of Algorithm 1: the equivalence of agnostic and realizable learning for distribution-family classification. labeled samples. 6 Note that A could be deterministic or randomized. This distinction has no effect on any of the arguments in this work.
Along with its novelty in the distribution-family setting (where no such equivalence was known), it is worth noting that in the distribution-free setting, Theorem 2.1 actually recovers the same sample complexity bound as standard analysis of uniform convergence. 7 We also note that while unlabeled sample complexity is not usually considered separately from labeled complexity in the PAC setting, this will become a useful distinction in semi-supervised extensions considered later in the work. As such, it is instructive to keep the complexities separate for the time being.
With this out of the way, let's prove Theorem 2.1. The analysis breaks naturally into two parts, corresponding respectively to Step 2 and Step 3 of Algorithm 1. In the first part, we'll show that C(S U ), the set of outputs corresponding to running the realizable learner A across all possible labelings of the unlabeled sample S U , is in some sense a "good approximation" of the class H. More formally, the crucial observation is that for any choice of the adversary's distribution, C(S U ) will (almost) always contain a hypothesis close to the optimal solution. Claim 2.2. For any distribution D over X × Y whose marginal D X ∈ D, with probability 1 − δ/2, there exists h ′ ∈ C(S U ) which is within ε/2 of the optimal risk: Once we have this claim, the second step is to show that Step 3, an empirical risk minimization process on C(S U ), gives the desired agnostic learner. This actually follows from standard arguments. In particular, given a hypothesis h ∈ C(S U ), let denote its empirical risk with respect to S L . Since C(S U ) is finite, a standard Chernoff+Union bound gives that with probability at least 1 − δ/2, the empirical risk of every hypothesis in C(S U ) with respect to S L is close to its true risk. Then as long as S L is sufficiently large, empirical risk minimization returns a solution with at most OP T + ε error with high probability (we'll formalize this in a moment).
It remains to prove Claim 2.2. The key observation lies in an equivalence between realizable PAClearning and a weak type of randomized covering: for any fixed h ∈ H, C(S U ) contains a hypothesis close to h with high probability.
where h ′ = A(S, h(S)). Since C(S U ) contains A(S U , h(S U )) for every h ∈ H by definition, the result follows.
More generally, we call such objects non-uniform covers. 7 Though it should be noted that the additional log(1/ε) factor can be removed by a more complicated chaining argument [15]. Definition 2.4 (Non-uniform Cover (Informal Definition 7.2)). Let (X, H) be a class over label space Y , D a marginal distribution over X, and C a random variable over the power set P (H). We call C a nonuniform (ε, δ)-cover of H with respect to D if for every h ∈ H: Note that Lemma 2.3 (and in general non-uniform covering) does not imply that C(S U ) contains hypotheses close to every h ∈ H simultaneously. This stronger object is called a uniform cover and takes provably more samples to construct (see Appendix G). In our case, a non-uniform cover is sufficient. Since the guarantee holds for every fixed h ∈ H, it must hold in particular for the optimal hypothesis h OP T , so C(S U ) contains some h ′ within ε/2 of optimal. Let's now formalize these ideas and put everything together to prove Theorem 2.1.
Proof of Theorem 2.1. Let D be the adversary's distribution over X × Y , and let h OP T ∈ H be a hypothesis achieving the optimal error. By Lemma 2.3, with probability 1 − δ/2, C(S U ) contains a hypothesis h ′ such that: Pr This implies Claim 2.2 (that C(S U ) contains a hypothesis with error at most OP T + ε/2) since We can now use standard empirical risk minimization bounds on C(S U ) to find a hypothesis with error at most OP T + ε. Chernoff and union bounds imply that with probability at least 1 − δ/2, the empirical risk of every hypothesis in C(S U ) on a sample of size O log(|C(S U )|/δ) ε 2 is at most ε/4 away from its true error.
Since h ′ has error at most OP T + ε/2, its empirical risk is at most OP T + 3ε/4, and by the above guarantee any hypothesis in C(S U ) with empirical risk at most OP T + 3ε/4 has true error at most OP T + ε.
Putting everything together, we have that with probability 1 − δ over the entire process, the empirical risk minimizer of C(S U ) has error at most OP T + ε as desired. The sample complexity bounds follow from noting that |C(S U )| is at most 2 n(ε/2,δ/2) , and at most e·n(ε/2,δ/2) d d if the class has VC dimension d. The sample complexity bound for the latter case then follows by plugging in the standard bound for distribution-free classification:

Property Generalization Beyond Binary Classification
While simplifying and generalizing classic results on distribution-free [3,4] and distribution-dependent [6] learning is nice in and of itself, the real benefit of Algorithm 1 lies in its ability to adapt across a huge range of models. In this section, we'll give a high-level overview of some extensions to popular learning settings beyond the PAC model, as well as looking at how Algorithm 1 extends to properties beyond agnostic learning (e.g. privacy, stability, and malicious noise). Detailed discussion and proofs of these results are given in the main body and appendices.

Generalizing Labels and Loss
We'll start by considering one of the most basic and important modifications in practice: generalized label spaces and loss functions beyond binary classification. This encompasses classic settings such as multi-class classification, regression, and more. Formally, the setup remains mostly the same. Let X be the instance space, and D a family of distributions over X. Instead of just working over {0, 1}, we will now consider classifiers over a generic label space Y , and loss function ℓ : Y × Y → R ≥0 . For simplicity, we will always assume that ℓ(y, y) = 0, and will say that ℓ satisfies the identity of indiscernibles if ℓ(y 1 , y 2 ) = 0 iff y 1 = y 2 . Given a family of classifiers H = {h : X → Y } and a distribution D over X × Y , the error or risk of a hypothesis h ∈ H is its expected loss: Learnability of a class (D, X, H, ℓ) in the distribution-family model is then defined exactly as before, where the only difference lies in replacing classification error with the above.
Unfortunately, it is fairly clear that a naive application of Algorithm 1 will fail in this general model. If Y is infinite, the unlabeled sample S U may have infinitely many possible labelings, causing C(S U ) to be infinite in turn. Since we are not in the domain where uniform convergence is equivalent to learnability, empirical risk minimization is no longer guaranteed to work, and Algorithm 1 may therefore fail. In fact, this is not just an issue with our algorithm: it is an inherent barrier. Realizable and agnostic learning simply aren't equivalent for most reasonable losses over infinite label classes. Proposition 3.1 (Proposition 8.1). Let ℓ be any loss function over R satisfying the identity of indiscernibles that is continuous in the first variable. Then there exists a class (D, X, H, ℓ) which is realizably learnable but not agnostically learnable.
It is worth mentioning that D can be taken to be the set of all distributions over X in this construction, so the lower bound holds in the distribution-free setting as well.
The construction in Proposition 3.1 uses the infinite label space to exactly encode each hypothesis (i.e. a single labeled example always uniquely determines the corresponding hypothesis). In the realizable case this is clearly learnable in a single sample, but a small amount of noise can completely erase this information so the class can't be agnostically learned. On the positive side, we use a simple modification of Algorithm 1 to show this is essentially the only barrier to agnostic learnability. Somewhat more formally, we call a class discretely learnable if for every ε > 0, there exists an ε-discretization 8 of (D, X, H, ℓ) that is learnable up to O(ε) error. Discrete learnability can informally be thought of as a very weak type of noise tolerance that essentially acts only to rule out the above construction.
We prove that discrete learnability is equivalent to agnostic learnability over two broad classes of loss functions. The first is a basic generalization of loss functions over finite label classes we call doubly bounded loss.
Theorem 3.2 (Informal Theorem A.2: Agnostic → Realizable (Doubly Bounded Loss)). Let ℓ : Y × Y → R ≥0 be a loss function such that ∀y 1 = y 2 ∈ Y : for some β > α > 0. Then for any class (D, X, H, ℓ), the following are equivalent: 1. (X, H, D, ℓ) is (properly) discretely-learnable. 8 A discretization is a class H ′ over a finite (or probably finite) label space such that every h ∈ H is close to some h ′ ∈ H ′ . See Section 8.1 for details.

(X, H, D, ℓ) is (properly) agnostically-learnable.
This result also implies that realizable and agnostic learning are equivalent over any loss function satisfying the identity of indiscernibles over finite Y , since realizable and discrete learnability are equivalent in this case and every such loss function is doubly bounded. We complement this with a lower bound showing a separation between realizable and agnostic learning for general loss over finite Y by exploiting a simple ternary loss function that fails the identity of indiscernibles (see Proposition 7.5).
While many reasonable loss functions on infinite label classes (e.g. ℓ p -loss) aren't bounded away from 0, they do tend to come with other structure we can utilize. We'll prove a similar result to the above under the weak assumption that our loss satisfies an approximate triangle inequality. Such loss functions, which we call approximate pseudometrics, can informally be thought of as generalizing any sort of distance-based loss.
While c-agnostic learnability is a weaker guarantee than we get for doubly bounded loss, it is actually necessary for approximate pseudometrics. In particular, there exist simple discretely-learnable classes over c-approximate pseudometrics which are not c ′ -agnostically learnable for any c ′ < c (see Proposition 8.6). It is also worth noting that the sample complexity blowup in both Theorem 3.2 and Theorem 3.3 remains polynomial in ε −1 (indeed nearly-quadratic) in most reasonable scenarios. 10 Finally we remark that along with being novel in the distribution-family setting, to our knowledge these results are actually new to the distribution-free setting as well, where such an equivalence was only known for bounded Lipschitz [5,22] or binary-valued [23,7] loss functions.

Beyond the PAC Setting
While allowing distributional assumptions through the distribution-family model is a good step towards practice, recent trends have starting branching even further away from the PAC setting. In this section, we'll discuss a prototypical example of applying Algorithm 1 to an extended model: adversarial robustness. In the appendix, we cover similar extensions to Partial Learning (Appendix C), SQ-learning (Appendix E), and Fair Learning (Appendix F).
Robust learning is an extension of the PAC model introduced to handle adversarial perturbations at test time. Practically, this is meant to ensure that an integrated prediction system (e.g. in a self-driving car) cannot be tricked by possibly imperceptible adversarial changes to the outside world. This can be formalized by modifying the way we compute error. Given a perturbation function mapping X to its power set, U : X → P (X) (think of U as specifying a neighbor set or set of possible corruptions for each x ∈ X), the robust risk of a labeling c : X → Y with respect to a distribution D over X × Y is: Realizable and agnostic robust learning are then defined analogously to the PAC-model where the standard error is replaced with robust error (though the distribution-family model does require a slight twist, see Appendix B for details). We show that a basic modification to Algorithm 1 again implies the two models are equivalent.
We note that this result can also be combined with our analysis for more general loss functions, albeit with a slightly worse c-agnostic parameter. In the classification setting, Theorem 3.4 generalizes recent work giving such an equivalence in the distribution-free model [8,24], though the sample complexity of our algorithm suffers an extra factor of ε −1 in this special case.

Beyond Agnostic Learning
So far we have only considered using Algorithm 1 to reduce from agnostic to realizable learning, albeit in a number of extended settings beyond the PAC model. On the other hand, we claimed in the introduction that Algorithm 1 can be used to build a learner satisfying any "finitely-satisfiable" property. In this section, we'll discuss two such examples: privacy, and malicious noise. In Appendix D we cover a similar application to uniform stability. Note that since we are only modifying the learning property in this section, the base (realizable) learner remains the same and does not require any additional constraints.
We'll start with Kearns and Li's malicious noise [17]. In this model, the learner has access to a faulty sample oracle O M (·) which returns a labeled sample from the adversary's true distribution with probability 1 − η, and otherwise receives an adversarially chosen pair (x, y). A realizable or agnostic learner is said to be tolerant to malicious noise if it achieves the standard PAC guarantees while drawing from the malicious oracle instead of the standard sample oracle. Like agnostic learning, tolerance to malicious noise is easy to achieve on finite hypothesis classes. As a result, a basic modification of Algorithm 1 gives a blackbox reduction from agnostic learning with malicious noise to realizable learning. This extends the original result of [17] to the distribution-family setting, and is tight in the sense that ε 1+ε is the best possible error tolerance in the malicious model [17]. As before, the result can be combined with our discretization techniques to give a similar equivalence for approximate pseudometric loss that is new in both the distribution-family and distribution-free settings.
While malicious noise is certainly distinct from the agnostic model, both are examples of noise-tolerance properties. We'll finish this informal discussion of our results with a learning property of a different flavor: privacy. Informally, an algorithm is said to be α-differentially private if its output is not susceptible to small changes in the underlying sample (see Section 8.3 for exact definitions). Privacy is a very strong condition, even relaxed notions such as (α, δ)-differential privacy (which essentially allows for a δ probability of privacy failure) require finite Littlestone dimension in the distribution-free setting [14], which rules out any sort of reduction directly from realizable learning.
On the other hand, our reduction actually is able to recover a different relaxation known as semi-private learning [19]. In this generalization of the well-studied 'label privacy' [25] setting, the learning algorithm has access not only to a private database of labeled samples, but also to a smaller public database of unlabeled data. This models a common practical scenario: a small portion of "opt-in" users are willing to release their participation, but still wish to hide any sensitive label data. By replacing the empirical risk minimization (ERM) process in Step 3 of Algorithm 1 with a popular private algorithm known as the exponential mechanism [26], we give a direct reduction from semi-private to realizable learning in the distribution-family setting. This generalizes Beimel, Nissim, and Stemmer's [19] original equivalence of realizable and semi-private learning in the distribution-free setting to the distribution-family model, and also extends to the general loss functions covered in Section 3.1. In fact, in this case our algorithm actually gives a quantitative improvement over previous work [19,27] in the distribution-free case.  11 and For fixed d and δ, Corollary 3.7 resolves the unlabeled sample complexity of semi-private learning, improving over the recent best of Alon, Bassily, and Moran [27], who additionally showed that any class which is not privately learnable requires at least Ω( 1 ε ) public samples to (semi)-privately learn. The private sample complexity, on the other hand, remains off by a logarithmic factor from known lower bounds [28]. Resolving the latter requires improving the standard application of the exponential mechanism, and remains an interesting open problem.
It is also worth noting that Theorem 3.6 and Corollary 3.7 are robust to some amount of shift in distribution between the public and private databases. This problem, often called covariate shift in other contexts, is a commonly observed issue in machine learning practice, and is especially of concern in privacy where a distribution over "opt-in" public users could easily differ from the overall distribution of private data. We discuss covariate shift in the semi-private setting in more depth in Section 8.4.

Proof Overview: Modification Archetypes
We've already seen how Algorithm 1 works in the basic case of binary classification, but covering the extended regimes above require some modifications. In this section, we'll overview the four generic types of modification we use to extend Algorithm 1 across the aforementioned settings.
Discretization. We'll start by discussing our main technique to extend Algorithm 1 to infinite label spaces. The basic idea is simple: since we cannot afford to run our learner over all possible labelings of S U , we instead run the learner over labelings coming from some discretization of the class. As long as we have access to a learner for the discretization, we can then use the same arguments covered in Section 2 to prove various occurrences of property generalization. We formalize these notions more generally in Section 8.1, where we use the technique to prove Theorem 3.3 (Theorem 3.2 is proved in Appendix A). Discretization can also be used to handle learning models such as the statistical query setting which output real-valued query responses (see Appendix E).
Subsampling. Another core limitation to Algorithm 1 is access to clean unlabeled data. Algorithm 1 works by running a realizable learner over a representative set of unlabeled data, but, in practice, such data may often be corrupted, and data-dependent assumptions such as margin might mean that the optimal hypothesis isn't even well-defined on this set. We handle cases like these by a simple sub-sampling procedure: instead of running our realizable learner over labelings of S U , we run the learner over all labelings of all subsets of S U . As long as S U contains some amount of uncorrupted data, this subsampling procedure will find it and we can maintain the guarantees discussed in Section 2. We use this technique to prove property generalization for models such as robust learning (see Theorem 3.4 and Appendix B), partial learning (see Appendix C), and malicious noise (see Theorem 3.5 and Section 8.2).
Replacing the Finite Learner. In the introduction, we proposed a general paradigm (guiding principle) called property generalization: that a variant of any learning property which holds for finite classes should in fact hold for any "learnable" class in the base model. The main idea relies on replacing Step 3 of Algorithm 1 (which, as stated, is an empirical risk minimization process) with a generic learner for finite classes with the desired property. For noise-tolerance properties such as agnostic and malicious noise, empirical risk minimization works. Properties such as privacy or stability, however, require a different finite learner. To prove Theorem 3.6, for example, we replace the ERM process in Algorithm 1 with the exponential mechanism [26]. We use a similar strategy in Appendix D to prove an analogous result for uniform stability.
Replacing the Base Learner. Finally we note a very basic modification of Algorithm 1 that allows us to extend property generalization beyond the PAC setting: simply replace the input realizable PAC learner with a realizable learner in the desired model. This is usually combined with one of the techniques above depending on the specific application, e.g. to prove property generalization for robust learning and the statistical query model. The same idea can also be used to analyze semi-private learning with covariate shift (see Section 8.4) and property generalization for fair learning (see Appendix F).

Related Work
Agnostic learning is a very widely studied model across learning theory, and works across many different sub-areas have noted model specific equivalences with realizable learning. Here we'll survey a few representative examples, and discuss how they relate and differ from our approach.

Beyond Binary Classification
Uniform Convergence and Multiclass Classification. It is well known that the uniform convergence equals learnability paradigm continues to hold for 0/1-valued loss functions over constant-size label spaces [3,23,29,30], and that agnostic and realizable learning are equivalent as a result. On the other hand, Daniely, Sabato, Ben-Devid, and Shalev-Shwartz [31] showed this is no longer the case as the number of labels grows large. In this regime, even basic multi-class learning is no longer equivalent to uniform convergence, so the connection between realizable and agnostic learning becomes non-trivial. A few years later, David, Moran, and Yehuyadoff (DMY) [7] proved the equivalence nevertheless holds in the infinite multi-class setting through the weaker sample compression equals learnability paradigm. While more general than the uniform convergence paradigm, their proof remains model-specific and fails in many of the settings we consider, e.g. partial learning [13].
Discretization and General Loss Functions. Basic forms of discretization were also considered back in the mid 90s in work on characterizing the learnability of real-valued functions. In a seminal work, Bartlett, Long, and Williamson (BLW) [5] proved that a scale-sensitive measure introduced by Kearns and Schapire [32] called fat-shattering dimension characterizes learnability under bounded Lipschitz loss functions. 12 BLW use a basic form of discrete learning (called quantization) to prove that fat-shattering dimension is a necessary condition, and use uniform convergence to prove sufficiency. We give a similar argument as BLW in the necessary direction, but show that uniform convergence is not necessary for the equivalence to hold, and instead use Algorithm 1 to appeal directly to discrete learnability. This allows us to extend BLW's result across a much more general set of loss functions and scenarios without strong model-specific assumptions.

Semi-supervised, Active, and Semi-Private Learning
Our reduction hinges on combining a realizable learner with unlabeled data to cut down the number of potential hypotheses in our class. The use of unlabeled samples to this effect is one of the core ideas in the field of semi-supervised learning [33,34]. Here, it is usually additionally assumed that the function to be learnt has some relation (or 'compatibility') to the underlying data distribution, for example it might have large margin on unlabelled data as in Transductive SVM [35] or redundant sufficient information as in Cotraining [36,37]. In their seminal work on the topic, Balcan and Blum [33] employed a similar strategy to Algorithm 1 in which they draw an unlabeled sample S U and select hypotheses consistent with each possible labeling based upon compatibility. They argue via uniform convergence that this results in a uniform cover, 12 Though the original work only considers ℓ1 loss, their techniques generalize to Lipschitz loss, see for instance [22]. and then use empirical risk minimization to select a good hypothesis in the cover. It is worth noting that around the same time a similar strategy independently found use in the online learning literature in work of Ben-David, Pal, and Shalev-Schwarz [9] who simulated the so-called 'standard optimal algorithm' (SOA) over a sequence of examples and applied weighted majority [38] over the resulting set of hypotheses to obtain an agnostic online learner.
Similar strategies have also found use in the related active learning literature. Hanneke and Yang [39] use the same technique to build a cover from unlabeled samples (adding one hypothesis consistent with each possible labeling), and then apply active (adaptive) query algorithms to learn the best hypothesis in the cover in as few labeled samples as possible. This generalized earlier work of Dasgupta [40], who assumed a priori that the cover was known to the learner ahead of time. Most recently, the approach has seen use in the study of semi-private learning. In their original work on the model, Beimel, Nissim, and Stemmer [19] again apply the same trick for building a uniform cover, but then find the best hypothesis privately via the exponential mechanism (similar to our proof of Theorem 3.6). The analysis of this strategy was later improved by Alon, Bassily, and Moran (ABM) [27].
The above works differ from ours in two crucial senses. First, each work focuses solely on developing an algorithm for their specific framework (rather than working to understand a more general equivalence or reduction between settings). In this sense, one can view each of these prior results as a specific instance of our general framework where the "base learner" in our reduction is restricted to be an empirical risk minimizer (or SOA in the online setting), and a problem-specific learner for the relevant property (online, agnostic, active, or private) is then applied over the resulting cover. Second, and perhaps most importantly, these previous works all rely fundamentally on uniform convergence. 13 This means that their algorithms break down as soon as one moves away from the original PAC model (even to say the basic distributiondependent setting), and can also lead to sub-optimal sample complexity bounds. In the analysis of semiprivate learning, for instance, we show that avoiding uniform convergence leads to asymptotically better bounds, actually resolving the public sample complexity of the model altogether. Indeed, one can show that building a uniform cover requires asymptotically more unlabeled samples than a non-uniform one, and therefore cannot result in optimal semi-supervised algorithms. 14

Non-Uniform Covering and Probabilistic Representations
Covering techniques have long been used in learning theory, and while almost all prior works focus on uniform notions (where all hypotheses are covered simultaneously), there is one notable exception. In 2013, Beimel, Nissim, and Stemmer (BNS) [41] introduced probabilistic representations, a strong randomized form of covering used to characterize pure differentially private learning. In the language of our work, given a class (X, H), a probabilistic representation is a distribution over subsets of H which is a non-uniform cover simultaneously over all distributions of X. BNS prove that private learning is equivalent to the existence of a probabilistic representations for the class. Equivalently, this can be thought of as the ability to build a non-uniform cover without access to the underlying distribution at all. On the other hand, we are interested in the much weaker setting where a non-uniform cover can be built from a bounded number of samples from the distribution (and crucially argue that this is equivalent to realizable learning). Thus in a sense, our core connection between realizable learning and non-uniform covering can be thought of as an analog of BNS' characterization of private learning by probabilistic representations.

Paper Organization
The remainder of this paper is split into two portions, the main body and the appendix. The main body covers our base reduction for finite label classes along with four archetypes of modification, and is meant to be read as written. On the other hand, the Appendix covers various applications of these archetypes to a variety of learning models and properties. These sections are all self-contained, and are meant more as a reference text in the sense that the reader interested in some particular model or property should simply skip to the section covering that application.
The main body is organized as follows: we cover preliminary definitions in Section 6, our base reduction from agnostic to realizable learning for finite label classes in Section 7, and discuss the four modification archetypes along with a representative application in Sections 8.1, 8.2, 8.3 and 8.4. In more detail, these sections respectively cover: extensions to infinite label classes via discretization, malicious noise via sub-sampling, agnostic semi-private learning via replacing ERM, and covariate shift via replacing the base learner.
In the Appendix we cover applications to doubly-bounded loss (Appendix A), robust learning (Appendix B), partial learning (Appendix C), uniformly-stable learning (Appendix D), the statistical query model (Appendix E), and fair learning (Appendix F), and discuss further connections of non-uniform covers to previous notions of covering (Appendix G).

Preliminaries
Before moving to a more formal discussion of our results, we'll cover the most basic learning models discussed in this work: standard (distribution-free) PAC-learning and distribution-family PAC-learning. Extended models we consider beyond these (e.g. malicious noise, robust learning, partial learning, etc.) will instead be introduced in their respective sections.

PAC-Learning
We'll start by reviewing the seminal PAC-learning model of Valiant [1] and Vapnik and Chervonenkis [2]. We start with a few core definitions for the setting of general loss. Let X be an arbitrary set called the instance space (e.g. R d ), Y a set called the label space (e.g. {0, 1}), and H a family of labelings of X by Y (that is a family of functions of the form h : X → Y ). Given a class (X, H), it will often be useful to consider its growth function Π H (n) which measures the maximum size of H when restricted to a sample of size n: We note that the growth function is trivially bounded by |Y | n , but one can often give stronger bounds when (X, H) satisfies some finite combinatorial dimension (e.g. VC-dimension for the binary case). While PAC-learning is sometimes used to refer only to classification, we will study the model under general loss functions. With that in mind, we call a function ℓ : Y × Y → R ≥0 a loss function if ℓ(y, y) = 0 for all y ∈ Y . We say a loss ℓ satisfies the identity of indiscernibles if ℓ(y 1 , y 2 ) = 0 iff y 1 = y 2 . Given any distribution D over X × Y and loss ℓ, the risk of a labeling h : X → Y with respect to D and ℓ is its expected loss: The goal of learning is generally to find a classifier h ∈ H that minimizes risk. More formally, there are two commonly studied variants of this problem. The original formulation, now called realizable learning, assumes the existence of a hypothesis in H with no loss.
A is called proper if it outputs only labels in H.
Perhaps a more realistic variant of PAC-learning is to drop this restriction on the adversary, and let them choose an arbitrary distribution over X × Y . This model, introduced by Haussler [4] and Kearns, Schapire, and Sellie [17], is known as agnostic learning. Definition 6.2 ((Agnostic) PAC-learning). We say (X, H, ℓ) is agnostic PAC-learnable if there exists an algorithm A and function n(ε, δ) such that for all ε, δ > 0 and distributions D over X × Y : For some settings covered in this work, it will turn out that reaching OP T + ε error is too stringent of a condition. However, we will show in these cases that it is sometimes possible to maintain a weaker guarantee and learn up to c · OP T + ε error for some constant c > 1. We call such classes c-agnostic learnable.
Finally, we note that for simplicity when ℓ is the standard "classification error:" we'll simply write (X, H) to mean (X, H, ℓ). Realizable and Agnostic Learning are well studied under many basic loss functions including binary classification, where both models are known to be characterized by a combinatorial parameter called VC-dimension.

Learning Under Distribution Families
The standard PAC-models described above are often called distribution-free due to the fact that no assumptions are made on the marginal distribution over X. In practice, however, this is usually too worst-case an assumption. We often expect distributions in nature to be "nice" in some way, or at least somewhat restricted. This is reflected in the fact that popular machine learning algorithms usually significantly outperform the PAC-model's worst-case generalization bounds. Indeed such niceness assumptions have long been popular in learning theory as well, where conditions such as tail bounds or anti-concentration are frequently used to build efficient algorithms. These ideas are captured more generally by a simple (but notoriously difficult) extension to the PAC framework originally proposed by Benedek and Itai [6], where the adversary is restricted to picking from a fixed, known set of distributions. Definition 6.3 ((Realizable) Distribution-Family PAC-learning). Let X be an instance space and D a family of distributions over X. We say (D, X, H, ℓ) is realizable PAC-learnable if there exists an algorithm A and function n(ε, δ) such that for all ε, δ > 0 and distributions D over X × Y satisfying: Agnostic learning is defined similarly. The adversary must still choose a marginal distribution in D, but the conditional labeling can be arbitrary. Definition 6.4 (Agnostic Distribution-Family PAC-learning). We say (D, X, H, ℓ) is agnostic PAC-learnable if there exists an algorithm A and function n(ε, δ) such that for all ε, δ > 0 and distributions D over X × Y satisfying: A outputs a good hypothesis with high probability: The weaker c-agnostic learning is defined analogously with OP T replaced by c · OP T . Unlike the standard model, very little is known about distribution-family learnability. While a number of works have made some progress on this front [6,20,42,43], a characterization of learnability remains elusive despite some 30 years of effort.

The Core Reduction: Agnostic to Realizable Learning
In this section, we give a more detailed exposition of our main reduction as covered in Section 2, including the more general setting of arbitrary loss on constant size label spaces (in the distribution-family model), matching lower bounds, and additional discussion of non-uniform covers. As mentioned previously, since there is no known combinatorial characterization of learnability in the distribution-family model, standard techniques [3,23,29,30] cannot be used, and it is plausible that no combinatorial characterization of learnability exists for this model at all [21].
Before jumping into our reduction proper, it is worth discussing why we can't simply take the approach of prior works and rely on uniform convergence, a strong condition which promises that on a large enough sample, the empirical error of every hypothesis will be close to its true error. While uniform convergence was a very popular technique in the early years of learning, practitioners have since moved away from the paradigm which fails to capture learning rates seen in practice [44,45]. Indeed it soon became clear that the technique failed to capture even basic theoretical models such as the distribution-dependent setting. 15 Proposition 7.1 (Benedek and Itai [6]). There exists a PAC-learnable class (D, X, H) over binary labels and classification loss without the uniform convergence property.
Proof. Let X = [0, 1], D be the uniform distribution over X, Y = {0, 1}, and H consist of all indicator functions for finite sets S ⊂ X, as well as for X itself. It is not hard to see that (D, X, H) is realizably PAClearnable by the following scheme in only a single sample: if the learner draws a sample labeled 1, output the all 1's function. Otherwise, output all 0s. When the adversary has chosen a finite set, with probability 1 the learner draws a sample labeled 0, and outputs a hypothesis with 0 error (since the finite set has measure 0). If the adversary chooses the all 1's function, the learner will always output the all 1's function.
On the other hand, it is clear that when the adversary chooses the all 1's function, no matter how many samples the learner draws, there will exist a hypothesis in the class that is poorly approximated by the sample. Namely the hypothesis whose support is given by the support of the sample itself has empirical measure 1, but true measure 0. As a result, this class fails to have the uniform convergence property despite its learnability.
In later sections, we will even see distribution-free models where uniform convergence fails, such as the Partial PAC model [12,13] which captures realistic scenarios such as learning with margin. Since even the most basic modifications of PAC-learning fail to satisfy uniform convergence, it is clear we need to move beyond the condition to gain a more general understanding of the common phenomenon of equivalence between learning models.
Instead of relying on uniform convergence, our core observation is an equivalence between learning and sample access to a combinatorial object we call a non-uniform cover.
Definition 7.2 (Non-uniform Cover). Let (X, H) be a class over label space Y and L X,Y denote the family of all labelings from X to Y . If C is a random variable over the power set P (L X,Y ) and d : We call C bounded if its support lies entirely on subsets of size at most some k ∈ N, and we call the smallest such k its size.
Non-uniform covers share a close connection to several notions of covering used throughout the learning literature such as uniform covers [27] and fractional covers [46]. We discuss these connections in more detail in Appendix G. For the moment, we note only that previous works using the strictly stronger notion of uniform covering necessarily lose factors in the sample complexity as a result. We discuss this further in Section 8.3 as well.
In Section 2, we argued (at least implicitly) that once we have sampling access to a bounded non-uniform cover, agnostic learnability follows from standard arguments. Namely since a sample T has bounded size and is guaranteed to contain a concept "close" to optimal, it suffices to run empirical risk minimization over about log(|T |/δ)/ε 2 samples. The key to our reduction therefore boils down to turning blackbox access to a realizable PAC-learner into sampling access to some relevant non-uniform cover. This is given by Step 2 of Algorithm 1, which we rewrite here as a subroutine called LEARNINGTOCOVER.

2.
Return the set of responses C(S U ): In fact, we already argued in Section 2 that LEARNINGTOCOVER gives sampling access to a nonuniform cover, but we will re-write the result here in this formulation for convenience.
Proof. The proof is essentially immediate from the definition of realizable PAC-learning. A promises that for any h ∈ H and D ∈ D, a 1 − δ fraction of labeled samples (S U , h(S U )) ∼ D n(ε,δ) satisfy This means that as long as we have blackbox access to a realizable PAC-learner and unlabeled samples from the adversary's distribution, we can simulate access to a non-uniform cover. Let's now formalize our previous intuition that this is sufficient to turn a realizable learner into an agnostic one for any finite label class. We will generalize this result to doubly-bounded loss in Appendix A, but it is instructive to consider the setting of finite Y first.
Proof. Let A be the promised realizable learner for (X, H, D, ℓ) with sample complexity n(ε, δ). Run LEARNINGTOCOVER with parameters ε ′ = η ℓ ε and δ ′ = δ/2. We argue that the output contains some h ′ such that err D,ℓ (h ′ ) ≤ OP T + ε/2. To see why C(S U ) has this property, recall that for any h ∈ H, Lemma 7.3 states that C(S U ) contains some h ′ such that: Because we assume that ℓ(a, b) = 0 iff a = b, this actually implies a stronger relation-h and h ′ must be close in classification error: Pr ℓ(a, b)) .
Let h OP T ∈ H be an optimal hypothesis, and let h ′ OP T denote the corresponding output of LEARNINGTO-COVER. Then by the above, we have that: ℓ(a, b)) where we have used the assumption that we set η ℓ = c min a =b (ℓ(a,b)) max a =b (ℓ(a,b)) for some universal c > 0.
It's worth spending a moment discussing our only assumption on the loss function ℓ, that it satisfies the identity of indiscernibles. This is not only a natural assumption for most cases in practice (that mislabeling has non-zero error), it is theoretically justified as well: realizable and agnostic learning aren't necessarily equivalent for ℓ without this property, even in the distribution-free setting. Proof. Let the instance space X = N be the set of natural numbers, the label space Y = {0, 1} 2 . We consider the hypothesis class H with all functions which output the first bit as 0, that is: Furthermore, we define the loss function ℓ : Y × Y → {0, 1, c} as Note that (X, H, ℓ) is trivially learnable in the realizable setting simply by returning any h ∈ H. On the other hand, we will show it is only O(c)-agnostically learnable. First, notice that for any labelling f : X → Y , there exists a hypothesis h ∈ H which matches f on the second bit, and therefore for any marginal D over X: As a result, it suffices to show that for every m ∈ N and (randomized) algorithm A using m samples there exists a labeling f : X → Y and marginal distribution D X such that As long as this holds Markov's inequality gives that every algorithm must have error at least Ω(c) with constant probability. For simplicity, we will restrict our attention in the rest of the proof to the marginal distribution D X which is uniform over the set [k] for some natural number k we will fix later. To prove Equation (1), by Yao's minimax principle it is enough to prove there is a distribution µ over functions f : [k] → Y such that any deterministic algorithm A has expected loss at least c/12 over µ: We now show that the above holds for µ being uniform over all functions from [k] to Y for any k > 2m.
Here, we have that where the last step follows from noting that for any value (a, b) that A(S, f (S)) assigns to x / ∈ S, f (x) will be (1 − a, 1 − b) with probability 1/4 incurring a loss of c. The result then follows by noting that for every x ∈ [k]: . Therefore, we get that which completes the proof.
Note that this bound holds even if A is allowed to be improper. It is worth noting that if we are willing to increase the size of Y , the learner's error in this bound can actually be increased all the way to c, the maximum possible (see Proposition 8.1). This is in fact tight, as we will show that any loss function like the above satisfying a c-approximate triangle inequality can be c-agnostically learned (that is learned to within c · OP T + ε error).

Discretization: Infinite Label Classes
In the previous section, we showed that our base reduction characterizes the equivalence of realizable and agnostic learning for loss functions satisfying the identity of indiscernibles for all finite label classes. In this section, we discuss a technique called discretization that allows our reduction to extend this result to infinite label classes. Normally, it's clear that when Y is infinite our standard reduction will fail: since the total number of possible labelings of a finite sample may be infinite, LEARNINGTOCOVER may output an infinite set. In fact, this is more than a technical barrier: realizable and agnostic learning simply aren't equivalent for infinite label classes. More formally, let D be the family of all distributions. By the continuity of ℓ and the fact that ℓ(0, 0) = ℓ(1, 1) = 0, for all ε > 0 notice that there exists γ = γ(ε) > 0 such that max 0≤γ ′ ≤γ {ℓ(γ ′ , 0), ℓ(1 + γ ′ , 1)} < ε. Let n γ ∈ N be the index of the first non-zero digit in the binary representation of γ. The idea is to note that beyond these first n γ coordinates, our class if within ε of an arbitrary boolean function. More formally, notice that for any distribution D and boolean function f which is 0 on [n γ ], we have that OPT H (f ) := min h∈H {err D×f,ℓ (h)} ≤ ε. The bound then follows from the fact that such arbitrary functions are not learnable.
In more detail, Yao's minimax principle states that it is sufficient to show that for any potential sample complexity m(ε, δ), there exists a randomized strategy for the adversary such that no deterministic learner can achieve OP T + c accuracy with constant probability for some constant c > 0. To this end, consider the following strategy: the adversary chooses the uniform distribution over [n γ , n γ + 2m(ε, δ)], and a binary function on that interval uniformly at random (recall OP T is at most ε for every such function). Since the learner can only see 1/2 of the mass, any strategy must be incorrect on half of the remaining points in expectation. In particular conditioned on any sample, the expected loss of any predicted labeling of an unseen point is at least ℓ min-err = min { ℓ(y,0)+ℓ(y,1) 2 } (since each unseen label appears with 1/2 probability conditioned on the learner's sample). The total expected loss of any strategy is then at least ℓ min-err /2, which is bounded away from 0. Setting ε and c sufficiently small then gives the desired result. Proposition 8.1 relies crucially on the fact that the adversary can erase a significant amount of information with a very small label perturbation. In the rest of this section, we'll discuss a technique for modifying our reduction that shows this is essentially the only barrier between realizable and agnostic learning (at least for a broad class of loss functions). The key is to require a slightly stronger notion of learnability based upon discretization.
Definition 8.2 (Discretization). We say (D, X, H ′ , ℓ) is an ε-discretization of (D, X, H, ℓ) if the following three conditions hold: 1. H ′ is probably bounded. That is for all n ∈ N, δ > 0, and D ∈ D there exists a bound m(n, δ) ∈ N such that: Note that most realistic settings have reasonable discretizations (e.g. it is enough to have some Lipshitzlike condition and a weak tail-bound on the loss). We now define a basic notion of learnability based on discretization which essentially serves to rule out adversarial constructions in the vein of Proposition 8.1.
Approximate pseudometrics are natural choices for loss functions in practice and capture a broad set of scenarios including finite-range losses and standard setups such as ℓ p -regression, and have seen some previous study in the literature [47]. By modifying the first step of our reduction to take discretization into account and leveraging the approximate triangle inequality in the second, we prove that discrete learnability and c-agnostic learnability are equivalent under c-approximate pseudometrics. Proof. The proof is similar to Theorem 7.4. We first show the forward direction. Assume (D, X, H, ℓ) is discretely-learnable. Fix ε ′ = ε 4c 2 c 1 (where c 1 is the constant from Definition 8.3), and let H ε ′ be a learnable ε ′ -discretization of H. We argue that running LEARNINGTOCOVER on H ε ′ gives the desired agnostic learner. Since ℓ is bounded, it is sufficient to prove that C(S U ) contains a hypothesis h ′ such that err D,ℓ (h ′ ) ≤ c · OP T + ε/2. Empirical risk minimization then works as in the finite case.
Let h OP T ∈ H be an optimal hypothesis. Since H ε ′ is a discretization of H, there exists h ε ′ OP T ∈ H ε ′ such that: ∀x ∈ X : ℓ(h OP T (x), h ε ′ OP T (x)) < ε ′ . Further, by the guarantees of discrete learnability, with probability at least 1 − δ/2 there exists h ′ ∈ C(S U ) such that close to h ε ′ OP T in the following sense: Plugging in the previous observation and applying our approximate triangle inequality, we get that h ′ is close to h OP T in the following sense: The final step is to transfer from the marginal D X to the full joint distribution of the adversary, which follows immediately from a similar application of the approximate triangle inequality. This is the only step that loses a factor in the OPT term: ≤ c · OP T + ε/2 as desired.
We now prove the reverse direction, which is essentially immediate. Assume the existence of a cagnostic learner for (D, X, H, ℓ). Given a discretization H ε , we want to show (D, X, H ε , ℓ) is learnable to within c 1 ε error for some c 1 > 0. This is achieved simply by running the agnostic learner on (D, X, H ε , ℓ). Since H ε is "always useful", every h ∈ H ε is ε-close to some h ′ ∈ H in the sense that: In particular, this means that for any choice of h by the adversary there exists h ′ ∈ H with low error: As a result, running the c-agnostic learner for (D, X, H, ℓ) returns a hypothesis of at most (c + 1)ε error with high probability.
It is worth noting that bounded loss is not really necessary for Theorem 8.5. More generally we can require that (D, X, H, ℓ) is "finitely learnable" in the sense that for all finite subsets H ′ ⊂ H, (D, X, H ′ , ℓ) is agnostically learnable. When ℓ is bounded, this is true for any finite class by empirical risk minimization.
It is also worth noting that various modifications to the definition of loss (e.g. defining loss between hypotheses rather than on Y directly) will continue to work with the above. Similarly, there are various cases when one can get better than c·OP T accuracy for a c-approximate pseudometric, generally by instead optimizing over some surrogate loss function. For instance, if a simple transformation of the loss gives a c ′ -approximate pseudometric for c ′ < c, then one can generally learn up to c ′ · OP T . 17 As an example, note that while square loss ℓ 2 (x, y) = (x − y) 2 is a 2-approximate pseudometric, taking √ err D,ℓ 2 gives a true metric between hypotheses. As a result, as long as OP T is bounded, we can get truly agnostic learning by optimizing √ err D,ℓ 2 instead. This strategy works for any polynomial loss, such as ℓ p (x, y) = |x − y| p . On the other hand, outside of these special cases, Theorem 8.5 is tight: there exist c-approximate pseudometric loss functions which cannot be c ′ -agnostically learned for any c ′ < c. The argument is similar to Proposition 8.1, but requires a bit more care. Proposition 8.6. There exists a discretely-learnable class over a c-approximate pseudometric that is not c ′ -agnostically learnable for any c ′ < c.
Proof. The proof is similar to Proposition 7.5. We consider the same instance space X = N and hypothesis class H: The loss function ℓ : Y × Y → {0, 1, c} is also the same, but extended to the larger domain Y = N 2 : As before, note that (X, H, ℓ) is trivially realizably learnable by always returning any h ∈ H, ℓ is a cpsuedometric by definition, and for any labeling f : X → Y there exists h ∈ H such that for all distributions D: OP T ≤ err D,f (h) ≤ 1.
We now show that the class (X, H, ℓ) is only c-agnostically learnable. Since OP T ≤ 1, it suffices to show that for every m ∈ N, large enough n ∈ N, and randomized algorithm A on m samples, there exists a labeling f : X → Y and a marginal distribution D X such that: For n ≥ 1 1−(1−(c−c ′ )/c) 1/3 , applying Markov's inequality to Equation (2) implies that A has error at least c ′ with constant probability.
For simplicity, we now restrict our attention to the marginal D X which is uniform over the set [k] for some k ∈ N to be fixed. By Yao's minimax principle, its enough to prove that there exists a distribution µ over functions f : [k] → [n] 2 such that any deterministic algorithm A the following holds We now show that the above holds for µ being uniform over all functions from [k] to [n] 2 when k > 2m ln(n/(n−1)) . Similar to Proposition 7.5, we have that since no matter the assignment A gives to x / ∈ S, it will be wrong on both coordinates with probability (1 − 1/n) 2 over the randomness of µ. The result follows by noting that for every and we have assumed k > 2m ln(n/(n−1)) and n > 1. Therefore, we get that which completes the proof.

Sub-sampling: Malicious Noise
Now that we've seen how to handle practical problems like regression over infinite label spaces, we'll discuss a technique that helps handle data corruption and data-dependent assumptions: sub-sampling. The main idea is as follows. Say that the original unlabeled sample we draw is, in some sense, partially corrupted: perhaps an adversary has changed some fraction of examples (malicious noise), or some portion of the sample is un-realizable for a concept in the class (robust and partial learning). In either case, there generally exists a core subset of "clean" samples that we can use to recover the guarantees of LEARNINGTOCOVER. Since we cannot necessarily identify these, the idea is to run LEARNINGTOCOVER over enough subsets of the unlabeled sample that we find a clean subsample with high probability. In this section we'll discuss the application of this technique in detail to Kearns and Li's [17] well-studied malicious noise model. In the appendix, we discuss applications to recently popular adversarially robust (Appendix B) and partial learning (Appendix C) models.
To start, let's recall the standard malicious noise model. In this variant of PAC learning, instead of having access to the standard sample oracle from the adversary's distribution D over X × Y , we have access to a malicious oracle O M (·) which, with probability η, outputs an adversarially chosen pair (x, y), and otherwise samples from D as usual.

The marginal D X ∈ D,
A outputs a good hypothesis with high probability over samples drawn from the malicious oracle of size n(ε, δ): Pr where OP T = min h∈H {err D,ℓ (h)}.
In other words, malicious noise essentially gives a worst-case formalization of the idea that an η-fraction of the learner's data is (adversarial) garbage.
Let's now formalize the argument above: modifying LEARNINGTOCOVER to run over subsamples gives a sample-efficient algorithm for learning with malicious noise. For readability, we'll (somewhat informally) restate the algorithm with this change.

Algorithm 3: Malicious to Realizable Reduction
Input: Realizable PAC-Learner A, Accuracy Parameter ε < 1/2, Noise Parameter η < ε 1+ε , Unlabeled Sample Oracle O U , Labeled Sample Oracle O L Algorithm: 1. Draw an unlabeled sample S U ∼ O U , and labeled sample S L ∼ O L . We now prove that Algorithm 3 gives an (agnostic) learner that is tolerant to malicious noise.
Proof. To start, we'll review for completeness a fairly standard analysis of empirical risk minimization under malicious noise. Assume for the moment that the output of LEARNINGTOCOVER, C(S U ), contains a hypothesis h ′ satisfying err(h ′ ) ≤ OP T + β 1 . Say we draw M labeled samples for the ERM step, and an η ′ = η + β 2 fraction are corrupted by the adversary. For large enough M , we can assume by a Chernoff and union bound that the empirical loss of every hypothesis returned by LEARNINGTOCOVER is at most some β 3 away from its true loss on the un-corrupted portion of M (we will make all these assumptions formal in a moment). Given these facts, notice that the empirical loss of h ′ on M is at most: On the other hand, the empirical error of any h ε ∈ H whose true error is greater than OP T + ε is at least: To ensure that our ERM works, it is enough to show that for any such h, err M (h) > err M (h ′ ). A simple calculation shows that this is satisfied as long as β 1 + 2β 3 ≤ ε and β 1 + β 2 + 2β 3 ≤ ∆. Setting β 1 = β 2 = β 3 = ∆/4 gives the desired result.
It is left to argue that our assumptions above hold with high probability. It is then left to show that C(S U ) contains a hypothesis of error at most OP T + ∆/4. To show this, it is enough to ensure that we run LEARNINGTOCOVER over a clean subsample of size at least n(∆/4, δ/4) with high probability. If we draw |S U | = O n(∆/4,δ/4) ∆ 2 unlabeled samples, a similar Chernoff bound to the above promises that at most an η ′ fraction are corrupted with high probability, and therefore that at least n(∆/4, δ/4) remain un-corrupted. Running LEARNINGTOCOVER over all subsets of size (1 − η ′ )|S u | then gives the desired result. The sample complexity bound follows from choosing M large enough to satisfy the above conditions along with the fact that |C( It's worth noting that the error tolerance of Theorem 8.8 is tight. In their original introduction of malicious noise, Kearns and Li [17] proved that for most non-trivial concept classes, no PAC-learner can be tolerate ε 1+ε malicious noise. Theorem 8.8 also extends to other scenarios we've seen so far such as arbitrary loss over finite label classes and approximate pseudometrics. The proof remains mostly the same, though the optimal error tolerance may differ. Since our agnostic model restricts the adversary to choosing a distribution whose marginal lies in the original family, Theorem 8.8 provides the first insight on robustness against an adversary who can corrupt the underlying data as well as the labels. One might wonder whether this result can be pushed further: is it possible to be robust against an adversary who can corrupt the marginal over X in some stronger sense? Unfortunately, the answer is no: malicious noise is necessarily the most distributional corruption we can handle. Let's look at two basic lower bounds to see why. First, we'll consider an adversary who can remove a portion of the learner's sample. Proof. This follows from a result of Dudley, Kulkarni, Richardson, and Zeitouni [20] that there exists an unlearnable class (D, X, H) such that for some n(ε, δ), (D, X, H) is learnable in n(ε, δ) samples for every D ∈ D. The lower bound then follows simply from adding an extra unique identifying point x D to X for every distribution D, and modifying each D ∈ D to have Θ(δ) support on x D . This modified class is clearly learnable, since after drawing O(1/δ) samples, the learner will draw x D and identify the distribution D with good probability. However, the class is not learnable under an adversary who removes points, since with high probability the adversary can completely remove any mention of x D from the learner's sample, reducing to the original unlearnable class (D, X, H).
An adversary who can add samples is similarly powerful. In the realizable setting, if the adversary is allowed to add an arbitrary number of correctly labeled points to the learners sample, basic classifiers such as halfspaces become unlearnable [48]. On the other hand if the adversary is limited to adding only a few additional samples, realizable learning may remain possible, but even trivial concept classes cannot be agnostically learnable.
Proposition 8.10. There exists a class (X, H) which for any γ > 0 is realizably but not agnostically learnable under an adversary who can add a γ fraction of correctly labeled 18 points to the learner's sample.
In the realizable setting, note that a single labeled example on x 1 or x 2 exactly determines the hypothesis. As long as there is less than 1 − ε mass on x, the learner will draw such a sample after O(1/ε) samples with good probability. Further, if the mass on x is at least 1 − ε, then h 1 and h 2 are both valid outputs. As a result, any ERM is a valid PAC-learner. Since adding correctly labeled examples can only help this learner, the class remains realizably learnable under an adversary who can add an arbitrary number of clean samples.
In the agnostic setting, consider an adversary who chooses a labeling f such that . The optimal hypothesis h OP T is then decided by the amount of mass on x 1 and x 2 in the marginal distribution. Namely if the adversary chooses a distribution D over {x, x 1 , x 2 }, the optimal error is min{D(x 1 ), D(x 2 )}. The idea is then to note that for γ ′ ≤ γ/4, the learner cannot distinguish between the two following distributions: where 1/2 > c 1 > 0 is some small constant. Informally, if the two distributions are indistinguishable, any learner will always incur error of around (1−c 1 )γ ′ 2 , whereas OPT is c 1 γ ′ for both distributions. Let's now give the formal argument. By Yao's Minimax Principle it is enough to prove there is a strategy over distributions such that any deterministic learner has high error. In particular, if we can prove that the expected error is at least 3 · OP T , then Pr[error ≥ 2 · OP T ] ≥ OP T . Since OPT is just some constant c 1 γ ′ (dependent only on γ), this is sufficient to prove the result. Moving on, consider the strategy in which the adversary chooses the labeling described above, and chooses each marginal (D 1 or D 2 ) with probability 1/2. We'll break our analysis into two cases dependent on the sample complexity of the learner. If the learner uses O(1/γ ′ ) examples, then there is a constant probability of drawing a sample only consisting of the point x. Let f ′ be the hypothesis returned by the deterministic learner on this sample. By construction, f ′ must disagree with either h 1 or h 2 on x 1 or x 2 . Assume f ′ differs on h 1 (x 1 ) (the other cases will follow similarly). When the distribution is D 2 , f ′ has error at least (1 − c 1 )γ ′ . Since this occurs with constant probability independent of the choice of c 1 , choosing c 1 sufficiently small leads to an expected error of at least 3c 1 γ ′ as desired.
On the other hand, when there are n = Ω(1/γ ′ ) samples, we claim that the adversary can force the following sample to occur with constant probability: 2γ ′ n instances of x 1 and x 2 , and (1 + γ)n − 4γ ′ n instances of x. This follows from the fact that for the appropriate choice of constant for n, a Chernoff bound gives that both x 1 and x 2 occur at most 2γ ′ n times with constant probability. Since the adversary is allowed to add γn ≥ 4γ ′ n arbitrary examples, they can add instances of x 1 , x 2 , and x until the above sample is achieved. The remainder of the argument is then the same as the previous case, as any learner response on this sample will incur similarly high expected error.
It is also reasonable to consider distributional corruption in the semi-supervised setting, where the unlabeled and labeled data-sets might have different underlying distributions. We discuss this model in Section 8.4.

Replacing ERM: Semi-Private Learning
So far we have focused on property generalization for two forms of noise-tolerance-agnostic learning and learning with malicious noise. In this section, we'll show how to use Algorithm 1 to generalize a broader spectrum of finitely-satisfiable properties through replacing the ERM process with a generic finite learner with the desired property. Our prototypical example will be privacy, which is well known to be finitelysatisfiable via McSherry and Talwar's [26] exponential mechanism. To start, we'll cover a few basic privacy definitions.
Definition 8.11 (Differential Privacy). A learning algorithm A is said to be α-differentially private if for all neighboring inputs S, S ′ which differ on a single example: for all measurable events T in the range of A.
The exponential mechanism is one of the most widely used techniques in privacy. Informally, the algorithm allows for differentially private selection of a "good" choice from a finite set of objects (potential hypotheses in our case). More formally, let s : (X × Y ) * × H → R be a "score" function, and define "sensitivity" ∆ s to be where S, S ′ are two neighbouring datasets. The exponential mechanism selects an item with a good score with high probability, while maintaining privacy.
Definition 8.12 (Exponential Mechanism [26]). The exponential mechanism M E on inputs S, H, s with privacy parameter α selects and outputs h ∈ H with probability .
It is well known that the exponential mechanism leads to a private learner for finite hypothesis classes under bounded loss.
We note that [49,Theorem 3.4] only considers classification loss, but the extension to bounded loss is immediate. Unfortunately, even with the power of the exponential mechanism, privacy is a very restrictive condition in the general PAC framework, since we're most often interested in infinite hypothesis sets. Indeed even improper private learning requires finiteness of a highly restrictive measure known as representation dimension [41], which can be infinite for classes of VC dimension 1. As a result, the past decade has seen the introduction of a number of weaker, more practical definitions of privacy. In this section we'll focus on a model introduced in 2013 by Beimel, Nissim, and Stemmer [19] called semi-private learning. Definition 8.14 (Semi-Private Learning). We call a class (D, X, H, ℓ) semi-private PAC-Learnable if there exists an algorithm A and two functions n pub = n pub (ε, δ, α) and n pri = n pri (ε, δ, α) such that for all ε, δ > 0 and distributions D over X × Y whose marginal D X is in D, A satisfies the following:

2.
A is semi-private. That is for all S U ∈ X n pub : In other words, semi-private learning offers a model for applications where labeled data is sensitive, but some (perhaps opt-in) users might not care about their participation itself being released. Unlike standard private learning, distribution-free semi-private classification is known to be characterized by VC dimension, just like realizable PAC-learning [19]. The best sample complexity bounds are due to Alon, Bassily, and Moran (ABM) [27], who use uniform convergence to build a uniform cover for H from unlabeled data, and then apply the exponential mechanism to the resulting cover.
Due to their reliance on uniform convergence, ABM's techniques fail in the more general settings we consider. Further, their use of uniform covers results in sub-optimal public sample complexity even for distribution-free classification. We prove in Appendix G that these objects require asymptotically more samples than non-uniform covers (at least in the distribution-family model), and therefore cannot be used to achieve optimal semi-private learning. We circumvent both of these issues by appealing directly to a realizable learner to build a weaker non-uniform cover. For readability, we first restate the algorithm here. 2. Run LEARNINGTOCOVER over S U to get C(S U ).
3. Return the hypothesis in C(S U ) given by applying the exponential mechanism with respect to S L .
We prove that Algorithm 4 gives a semi-private agnostic learner in the distribution-family setting. Proof. The proof is essentially the same as Theorem 8.5. The only difference in the argument is to replace the generic ERM learner over the output of LEARNINGTOCOVER with the exponential mechanism [49].
Let's now take a look at what Theorem 8.15 implies about the special case of distribution-free classification.
This improves over the recent upper bound of ABM who showed that In fact, for constant d and δ, Corollary 8.16 completely resolves the unlabeled sample complexity of semiprivate learning, as ABM [27] prove a matching lower bound. On the other hand, we note that the private sample complexity remains off by a log factor from the best known lower bounds of Chaudhari and Hsu [28].
Theorem 8.18 (Private Lower Bound [28]). There exist classes of VC dimension O(1) which require at least: private samples to learn.
While we have now resolved the public sample complexity of improper learning, it remains an interesting open problem for the proper regime where certain adversarial examples are known to require an extra log(1/ε) factor in the standard PAC sample complexity [50,51]. We conjecture that Theorem 8.15 should still be tight in this setting: namely that the unlabeled semi-private sample complexity should always be at least the realizable PAC sample complexity.

Changing the Base Model: Covariate Shift
One issue with semi-supervised models like semi-private learning is that, in practice, the distribution over unlabeled data probably won't match the labeled data exactly. In this section, we'll talk about a final modification to our reduction to tackle such scenarios and more generally to extend property generalization beyond the realizable PAC setting: replacing the base learner. In fact, we already saw this strategy used to a lesser extent in Section 8.1, where we replaced our standard realizable base learner with a discrete learner. Here we'll look at an application in which we assume our initial learner is robust to covariate shift [52], meaning that even if the distribution underlying the data shifts between train and test time, the algorithm will continue to perform well. This stronger assumption will allow us to build semi-private learners that can handle corruption between the public and private databases. To start, let's formalize covariate shift in the distribution-family model.
Definition 8.20 (Covariate Shift). Let (D, X, H, ℓ) be any class, and for every ε > 0 let C ε be a "covariateshift" function that maps every D ∈ D to some family of distributions over X. Given any distribution D ∈ D and any h ∈ H, let the error of a potential labeling be given by its worst-case error over C ε (D), that is: We say that (D, X, H, ℓ) is realizably learnable under covariate shift C = {C ε } if there exists an algorithm A and function n = n(ε, δ) such that for all ε, δ > 0, D ∈ D, and h ∈ H: We call such a learner robust to covariate shift.
We emphasize that in the above definition, the covariate shift family scales with the error parameter ε. This is a bit different than Shimodaira's original definition [52], but is a natural choice in our context since we consider algorithms which only use access to the original source distribution (sometimes called "conservative domain adaptation" [53]). In this setting, we'd expect that as we demand higher accuracy, the amount of covariate shift we can tolerate will decrease. Indeed in the agnostic model, it's clear this scaling is necessary by a similar argument to Proposition 8.10.
The key observation to apply learning under covariate shift in our reduction is simply to notice that the non-uniform cover output by LEARNINGTOCOVER must contain a hypothesis close to optimal under any shifted distribution in C ε (D). This can then be used to analyze any semi-supervised model where the marginal of the labeled distribution may be corrupted from D to any distribution in C ε (D). In this section, we'll again focus on the setting of semi-private learning. First, let's formalize what it means to be semi-private learnable under covariate shift. there exists an algorithm A and two functions n pub = n pub (ε, δ, α) and n pri = n pri (ε, δ, α) such that for all ε, δ > 0 and distributions D X ∈ D and D ′ over X × Y whose marginal D ′ X ∈ C ε (D X ), A satisfies the following: A outputs a good hypothesis over D ′ with high probability: where OP T ′ = min h∈H [err D ′ ,ℓ (h)] is the minimum error over the shifted distribution.

2.
A is semi-private. That is for all S U ∈ X n pub : In other words, we'd like to recover a near-optimal hypothesis even when the marginal distribution over private data is shifted from the public data. This is a realistic scenario in practice, since the distribution of "opt-in" users is likely different from the marginal over the total population. We'll show that this issue is solvable in the semi-private setting as long as the analogous issue in the non-private setting (distribution shift between train and test time) can be resolved.  ℓ(a,b)) is a constant depending only on ℓ.
Proof. The proof is essentially the same as Theorem 7.4. The only difference is to note that for all marginal distributions D X ∈ D and choices of shift D ′ X ∈ C ε (D X ), the output of LEARNINGTOCOVER has some where h OP T ′ is some optimal hypothesis over the shifted distribution D ′ . As in the standard analysis, this is guaranteed by including the output of the realizable learner on h OP T ′ (robustness to covariate shift promises this output is close under D ′ X with high probability). The remainder of the argument is exactly as in Theorem 8.5, with the exception of working over the shifted distribution D ′ X instead of D X .
We note that this result can also be extended to the more general loss functions discussed in Section 8.1 without too much difficulty with the appropriate definition of discrete learnability under covariate shift.
Since Theorem 8.22 is a bit abstract, let's take a look at one concrete application. Given a class (X, H), the class-dependent total variation distance is a metric on distributions measuring the worst case distance across elements of H∆H := {h∆h ′ : h, h ′ ∈ H}: It is not hard to see that any realizable learner is robust to O(ε) covariate shift in T V H∆H distance. We can then apply Theorem 8.22 to build a robust semi-private learner.
Finally, we note again that our original learner in these results is robust to covariate shift despite having no access to samples from the new distribution. Unfortunately, this model does come with fairly strong lower bounds regarding the type of covariate shifts to which it is possible to be robust [53]. One solution to this problem is to consider a relaxed variant called (non-conservative) domain adaptation, where the learner additionally has access to a small number of unlabeled samples from the test-time distribution. It is certainly possible to define an analog in the semi-private setting, but naively the use of unlabeled data from the private distribution breaks our reduction since privacy won't be preserved. We leave as an open question whether any sort of PAC-learner in the non-conservative model could imply semi-private learners with stronger robustness to covariate shift. Some progress has been made in this direction recently by Bassily, Moran, and Nandi [54] for distribution-free classification of halfspaces.

A Doubly Bounded Loss
In this section we discuss a natural generalization of loss functions over finite label classes we call doubly bounded loss: for all distinct y, y ′ ∈ Y , we require ℓ(y, y ′ ) ∈ [a, b] for some b ≥ a > 0. This is trivially satisfied by any loss function on a finite label class that satisfies the identity of indiscernibles.
As discussed in Section 8.1, since we now allow Y to be infinite, we need to work with discrete learnability instead of realizable learnability. We can use a slight modification to the discretization technique in Theorem 8.5 to prove the equivalence of discrete and agnostic learnability for doubly-bounded loss functions. Note that this is stronger than our guarantee for c-approximate pseudometrics, which only gives c-agnostic learnability.
Theorem A.2. Let ℓ : Y × Y → R ≥0 be an (a, b)-bounded loss function. Then for any class (D, X, H, ℓ) the following are equivalent: Proof. The proof that agnostic learnability implies discrete learnability is the same as in Theorem 8.5, so we focus only the forward direction. Assume (D, X, H, ℓ) is discretely-learnable. Fix ε ′ = aε 4b , and let H ε ′ be a learnable ε ′ -discretization of H. We argue that running LEARNINGTOCOVER on H ε ′ (using the promised discrete learner) gives the desired agnostic learner. As before, it is sufficient to prove that C(S U ) contains a hypothesis h ′ such that err D,ℓ (h ′ ) ≤ OP T + ε/2. Since ℓ is upper bounded, standard empirical risk minimization arguments then give the desired result.
To see why C(S U ) has this property, recall from Lemma 7.3 that for any h ∈ H ε ′ , with probability at least 1 − δ/2 there exists h ′ ∈ C(S U ) that is ε ′ -close to h in the following sense: Because ℓ is a-lower bounded, this implies that h ′ must be close to h in classification error: and since the loss is bounded by b, the risk of h ′ cannot be much more than of h: Let h OP T ∈ H be an optimal hypothesis. Since H ε ′ ε ′ -covers H, there exists h ε ′ OP T ∈ H ε ′ such that: Let h ′ε ′ OP T ∈ H ε ′ denote the output of the base learner A on the labeling given by h ε ′ OP T . Then by the above, we have that: It is worth noting that the upper bound on the loss can be removed if the adversary is restricted to choosing a marginal over Y which is weakly concentrated.

B Robust Learning
Robust learning is an extension to the PAC setting that models an adversary with the power to perturb examples at test time. In practice, this corresponds to the fact that we'd like our predictors to be stable to small amounts of adversarial noise-this could range anywhere from a sticker on a stop-sign tricking a self-driving car, to completely imperceptible perturbations that totally fool standard classifiers. The latter was famously demonstrated by Athalye, Engstrom, Ilyas, and Kwok [55], who showed how to generate such perturbations and provided the classic example of tricking a standard ImageNet classifier into thinking a turtle was a gun. Their seminal work caused an explosion of both practical and theoretical research in the area. 19 Formally, adversarial robustness is modeled simply by changing the error function to be the maximum error over some pre-defined set of neighboring perturbations.
Definition B.1 (Robust Loss). Let X be an instance space and U : X → P (X) a "perturbation function" mapping elements to a set of possible perturbations. Given a loss function ℓ : Y × Y → R ≥0 , the robust loss of a concept h : X → Y with respect to a distribution D over X × Y is: ℓ(h(x ′ ), y) . 19 Their work has over 900 citations despite being only four years old.
In other words, a hypothesis with low robust loss performs well even against an adversary who can perturb x to any "nearby point" (i.e. any x ′ ∈ U (x)). Standard realizable and agnostic Robust PAC-learning are then simply defined by replacing the standard error function with the robust error function. Robust learning in the distribution-family model does require one extra twist: we need to make sure that each hypothesis in the class actually has a corresponding distribution over which it is realizable. To this end, we introduce a basic notion of closure for distribution families.
Definition B.2 (Robust Closure). Let D be a set of distributions over an instance space X and H a concept class. Given any concept h, let X h denote the set of points in X on which h has 0 robust loss with respect to itself, that is: For notational simplicity, let D| h denote the restriction D| X h . The robust closure of D under H is: In the robust distribution-family model, it only really makes sense to define realizable learnability over the robust closure of D, since otherwise there may be hypotheses in the class that are not realizable with respect to any distribution in D and cannot be chosen by the adversary at all. With this in mind, let's formalize this model. Agnostic learnability is defined similarly, but since the adversary is unrestricted, there is no longer any need to take the robust closure.
Definition B.4 ((Agnostic) Distribution-Family Robust PAC Learning). A class (D, X, H, ℓ) is Robustly PAC-learnable in the agnostic setting with respect to perturbation function U if there exists an algorithm A and function n(ε, δ) such that for all ε, δ > 0 and distributions D over X × Y satisfying D X ∈ D, then: We note that different works consider different models of access to the perturbation set U as well (e.g. assuming U is known to the learner [8], or has some type of oracle access [56,24]). Our reduction requires fairly weak access to U -it is enough to be able to estimate the empirical robust loss of a hypothesis h over any finite sample S ⊂ X. With this in mind, let's now prove realizable and agnostic robust learning are equivalent in the distribution-family model. We'll focus on the special case of (multi-class) classification, and start by re-stating our modified algorithm for simplicity of presentation. decompose OP T as: where the last step follows from noting that by definition for all x in the support ofD| h OP T , h OP T is not constant on U (x). To get a function within ε/2 robust loss of OP T , we claim it is sufficient to prove C(S) contains some h within robust error ε/(2(1 − µ * )) of h OP T over D| h OP T , that is some h satisfying: Theorem B.5 can be extended to many of the generic property generalization results in the main body, including approximate pseudometric loss, malicious noise, and semi-private learning, though the exact parameters may be somewhat weaker (e.g. learning over non-binary loss may incur additional factors and lead to c-agnostic rather than truly agnostic learning).

C Partial PAC-Learning
Partial PAC-learning is an extension of the standard PAC model to functions that are only defined on a certain portion of the input. Originally introduced by Long [12] and recently developed in greater depth by Alon, Hanneke, Holzman, and Moran (AHHM) [13], this model allows for the theoretical formalization of popular data-dependent assumptions such as margin that have no known analog in the PAC model. Combined with the distribution-family framework, this captures a significant portion of learning assumptions studied in both theory and practice (e.g. learning halfspaces with margin and distributional tail bounds). Let's formalize this model, starting with partial functions. Standard Partial PAC-learning is defined much like the standard model with the simple modification that " * " labels are always considered to be incorrect. As a result, in the realizable case, when the adversary selects a particular partial function f , their marginal distribution over the instance space X must be restricted to lying on supp(f ). This makes formalizing data-dependent assumptions easy. If one wanted to consider halfspaces with margin γ for instance, one simply labels every point within γ of the decision boundary as " * ." Interestingly, much like the distribution-family setting, Partial-PAC learning falls outside both the uniform convergence and the sample compression paradigm [57]. AHHM also show a dramatic failure of empirical risk minimization: not only does naively applying an ERM to the partial class fail, it will also fail on any total extension of the class. Despite the lack of these standard tools, both Long and AHHM were able to show that distribution-free classification of partial classes is still controlled by VC dimension, and as a result that the equivalence of realizable and agnostic learnability extends to this setting. In this section, we'll discuss how a variant of our reduction shows that this result extends to the distribution-family model, extended loss function, and to properties beyond agnostic learning.
In the distribution-family model, formalizing realizable learnability requires some slight changes from the standard model, since we need to make sure our hypotheses are actually realizable over some distribution in the family (this is automatic in the distribution-free setting). To this end, we introduce a basic notion of closure for distribution families.
Definition C.2 (Partial Closure). Let D be a set of distributions over an instance space X and H a concept class. Given any concept h, and distribution D over X, let D| h denote the restriction D| supp(f ) . The partial closure of D under H is: In the realizable model it makes more sense to work with the closure of D than D itself, since otherwise the class H may contain hypotheses that cannot be realized over any distribution, and therefore cannot be accessed by the adversary at all. For simplicity, we'll also restrict our attention to (multi-class) classification where the label space Y = [m], and recall that the loss of any undefined point is always 1. Agnostic learnability is defined analogously, but since the adversary is unrestricted, there is no need to move to the closure of D.
Definition C.4 ((Agnostic) Distribution-Family Partial PAC Learning). A partial class (D, X, H) is PAClearnable in the agnostic setting if there exists an algorithm A and function n(ε, δ) such that for all ε, δ > 0 and distributions D over X × Y satisfying D X ∈ D, then: Proof. The proof is essentially the same as for Theorem B.5, but we repeat it here for completeness. As always, it is enough to prove that C(S) (from Algorithm 6) contains a hypothesis h ′ with error at most OP T +ε/2. The key issue with our standard reduction is that the optimal hypothesis h OP T may be undefined on certain examples in the unlabeled sample S U . By running over all subsamples of S U , we in essence simulate pulling samples only from the support of h OP T , which is enough to get the desired guarantee.
More formally, let D| h OP T be the restriction of D to supp(h OP T ), andD| h OP T the restriction to its complement X \ supp(h OP T ). The idea is to decompose our analysis into two separate parts over D| h OP T andD| h OP T . With this in mind, let µ * denote the mass of D X on the undefined portion of h OP T , and let OP T ′ denote the error of h OP T over D| h OP T . Since we have restricted our attention to classification, notice that we can decompose OP T as: We'd like to prove that C(S) contains a hypothesis h within ε/2 error of optimal. We claim it is sufficient to show that C(S) contains a hypothesis within ε/(2(1 − µ * )) classification distance of h OP T , since: where the third line follows from the fact that h ′ and h OP T only differ on a ε 2(1−µ * ) fraction of inputs over D| h OP T .
It is left to argue that C(S) contains such a hypothesis h. Recall that on a labeled sample (S, h(S)) ∼ D| h OP T × h OP T of size n(ε/(2(1 − µ * )), δ/3), LEARNINGTOCOVER will contain h that is ε/(2(1 − µ * ))close to h OP T in classification error over D| h OP T with probability at least 1 − δ/3. The idea is then to draw a large enough unlabeled sample such that with probability at least 1 − δ/3, the restriction of the sample to D| h OP T is at least this size (since we run over every subsample, we will always hit this restriction). By a Chernoff bound, it is enough to draw c 1 n(ε/(2(1−µ * )),δ/3) 1−µ * points to achieve this for some large enough constant c 1 > 0. 21 Since we do not know µ * , we'll need to draw c 1 max µ∈[0,1−ε] n(ε/(1−µ),δ/3) 1−µ points to ensure this property holds (if µ * ≥ 1 − ε, note that any hypothesis gives a valid solution). By a union bound we have that this overall process succeeds with probability at least 1 − 2δ/3, and outputting the hypothesis in C(S) with the lowest empirical risk then succeeds with probability 1 − δ as desired.
Like Theorem B.5, Theorem C.5 can be extended to many of the generic property generalization results in the main body, including approximate pseudometric loss, malicious noise, and semi-private learning, though it may experience some degradation of parameters (e.g. c-agnostic rather than truly agnostic learning) depending on how the loss of " * " values are formalized in these settings.

D Uniform Stability
Uniform stability, originally introduced by Bousquet and Elisseeff [58], is a useful algorithmic property that is closely tied to both generalization and privacy. Informally, an algorithm A is said to be uniformly stable if for all elements x ∈ X, the probability that A changes its output on x over neighboring datasets is small.
Definition D.1 (Uniform Stability). A learning algorithm is said to be α-uniformly stable if for all neighboring inputs S, S ′ which differ on a single example, all x ∈ X, and all y ∈ Y : Uniform stability can also be thought of as a form of private prediction [59], which protects against adversaries who have restricted access to a model only through prediction responses on individual points (this is often the case in practice since it is common to release APIs with query access rather than full models). Like semi-privacy, this definition has the benefit of maintaining practicality in a reasonable range of circumstances while weakening the stringent requirements of standard private learning. Indeed, it is well known that in the distribution-free classification setting, uniformly stable learning and private prediction are both possible for any class with finite VC dimension [60,59,61]. Unsurprisingly, these previous works (at least those working in the agnostic model), rely on uniform convergence and uniform covers. We'll show these can be replaced with a variant of our standard reduction. The argument is otherwise similar to the proof in [61]. Proof. The proof boils down to a standard subsampling trick first noted by [60]. Instead of drawing our standard unlabeled sample of size n(ε/2, δ/2), we draw a sample of size n(ε/2,δ/2) 2α and run LEARNINGTO-COVER over a random α/2 fraction of the sample. This ensures that swapping out any individual sample can only effect the result with probability α/2. Since this subsample is of size n(ε/2, δ/2), LEARNINGTO-COVER keeps its standard guarantees and the output C(S U ) has a hypothesis within ε/2 of optimal with probability 1 − δ/2. We can now apply the exponential mechanism with privacy parameter α/4, which ensures the algorithm is α/2-uniformly stable with respect to the labeled sample as well. The sample complexity bounds come from standard analysis of the exponential mechanism and the size of C(S U ). Semi-privacy comes for free due to our use of the exponential mechanism.
As in previous sections, Theorem D.2 can be extended to any of the generic property generalization results in the main body, including for instance c-approximate pseudometric loss, malicious noise, and robustness to covariate shift.

E Statistical Query Model
Kearns' [62] statistical query model is a popular modification of PAC learning where the sample oracle is replaced with the ability to ask noisy statistical questions about the data.
Definition E.1 (Realizable SQ-learning). Given a distribution D over X and h ∈ H, let ST AT (D, h) be an oracle which upon input of a function ψ : X × Y → [−1, 1] and tolerance τ ∈ R ≥0 may output any estimate of the expectation of ψ up to τ error, that is: We call a class (D, X, H, ℓ) SQ-learnable if for all ε > 0, there exists some tolerance τ = τ (ε), query complexity n(ε, τ ), and an algorithm A such that for all D ∈ D and h ∈ H, A achieves ε error in at most n(ε, τ ) oracle calls to ST AT (D, h) with tolerance at worst τ . 22 Agnostic learning is then defined analogously where D, h is replaced with a generic distribution over X × Y whose marginal lies in D. We can use a basic form of discretization to prove property generalization in the SQ model.
Proof. The idea is similar to our discretization in Theorem 8.5. The realizable SQ-learner A makes some finite n(ε, τ ) queries. Let C A denote the set of outputs of A when fed every possible combination of responses from the discretized set {−1, −1 + 2τ, . . . , 1 − 2τ, 1}. For every D ∈ D and h ∈ H, one of these combinations must be a valid query response in the realizable model, so C A covers (D, X, H, ℓ). By the same arguments of Theorem 8.5, C A must contains a hypothesis with error at most c · OP T + ε. Since we can directly compute the loss of every element in C A up to τ error in the SQ model simply by querying the loss function, this gives the desired result in |C A | = (1/τ ) n(ε,τ ) queries.
We note that while our reduction in this model experiences exponential blowup in the number of queries, this should really be thought of as corresponding to a blow up in run-time instead of "sample complexity" in the standard sense (which corresponds more closely to τ ).

F Fairness
Recent years have seen rising interest in an algorithmic property called fairness. Informally, fairness tries to tackle the issue that "well-performing" classifiers in the standard sense may actually be discriminatory against certain individuals or subgroups. We will consider a form of fair leaning introduced by Rothblum and Yona [63] called Probably Approximately Correct and Fair (PACF) learning. Their definition is based off of a notion of fairness that ensures that similar individuals are treated similarly with respect to a fixed metric.
Definition F.1 (Metric Fairness). Let d : X × X → R ≥0 be a similarity measure on X and D a distribution over X. A classifier h : X → Y out is called (α, γ)-fair with respect to d and D if h acts similarly on most similar individuals: Pr We note that the output space Y out may differ from the label space Y in general learning problems. 22 While we generally think of τ as being at worst polynomial in ε, this is not strictly necessary for the model.
In fact, this definition only really makes sense when the output classifier h is allowed to be real-valued (as this allows for some flexibility in the |h(x) − h(x ′ )| term). As such, when considering settings such as binary classification where Y = {0, 1} is discrete, Rothblum and Yona's [63] initial formalization considers returning probabilistic classifiers with Y out = [0, 1]. Here h(x) = y ∈ [0, 1] is taken to be the probability of the label being 1. The error of a probabilistic classifier h with respect to any distribution D over X × {0, 1} is then given by its expected ℓ 1 distance: For simplicity we'll focus in this section on this same regime extended to the distribution-family model.
In broad strokes, the goal of Fair PAC learning is to output a fair classifier satisfying standard PAC guarantees. Practically this requires a few modifications. First, since there may be no fair classifier satisfying these guarantees, we will only require our output to be as good as the best fair classifier. Second, we will actually allow some slack in the fairness parameters, which Rothblum and Yona [63] show is a practical way to ensure that fair learnability remains possible across a broad range of classes.
The key observation is that the definition of fairness depends only on the classifier h and the marginal distribution D X . Let h OP T be the hypothesis achieving the minimum error over H D X ,α,γ . By the above observation, with probability 1 − δ the hypothesis set C(S U ) returned by LEARNINGTOCOVER contains an (α + ε α , γ + ε γ )-fair h satisfying: Since ℓ 1 error is a metric (and therefore satisfies the triangle inequality), we can use our argument for c-pseudometric loss functions from Theorem 8.5 to argue that choosing the lowest empirical risk (α + ε α , γ + ε γ )-fair classifier in C(S U ) with respect to a sufficiently large labeled sample S L gives the desired learner.
With care, this result can be extended to a broader range of loss functions as well as to other finitelysatisfiable properties covered in this work.

G Notions of Coverability
In this section we discuss the connection between non-uniform covers and several previous notions of coverability used in various learning applications. For simplicity, we'll restrict our attention to covering with respect to standard classification distance, that is given a distribution D and hypotheses h and h ′ over some instance space X: To start, let's recall the basic notion of an ε-cover specified to this measure for simplicity.
Definition G.1 (ε-cover). Let X be an instance space, Y a label space, and let L X,Y denote the set of all labelings c : X → Y . A set C ⊂ L X,Y is said to form an ε-cover for (D, X, H) if for every hypothesis h ∈ H, there exists c ∈ C such that d D (c, h) ≤ ε .
C is called proper if C ⊂ H.
Finite ε-covers are exceedingly useful in learning theory. As discussed in Section 5, a common strategy in the literature is to use unlabeled samples to construct an ε-cover with high probability [33,39,27,64]. This results in a distribution over potential covers we call a uniform (ε, δ)-cover.
Definition G.2 (Uniform (ε, δ)-cover). Let X be an instance space, Y a label space, and let L X,Y denote the set of all labelings c : X → Y . A distribution D C over the power set P (L X,Y ) is said to form a uniform (ε, δ)-cover for (D, X, H) if: [C is an ε-cover for (D, X, H)] ≥ 1 − δ.
D C is called proper if its support lies entirely in H.
In this work, we introduce a weaker non-uniform variant of this notion where each h has an individual guarantee of being covered by the distribution, but it is not necessarily the case that a sample will cover all h ∈ H simultaneously. Definition G.3 (Non-Uniform (ε, δ)-cover). Let X be an instance space, Y a label space, and let L X,Y denote the set of all labelings c : X → Y . A distribution D C over the power set P (L X,Y ) is said to form a non-uniform (ε, δ)-cover for (D, X, H) if for every fixed hypothesis h ∈ H, In the context of learning, we are usually interested not just in the existence of these covers, but in the more challenging problem of constructing them from a small number of unlabeled samples. In other words, given a class (D, X, H), we'd like to know how many unlabeled samples from an adversarially chosen distribution D ∈ D are necessary to build a uniform (or non-uniform) (ε, δ)-cover for (D, X, H). In Section 8.3, we saw that the ability to construct a non-uniform (ε, δ)-cover from O log(1/δ) ε samples was crucial to give a semi-private learner with optimal public sample complexity. This improved over recent work of Alon, Bassily, and Moran (ABM) [27], who showed that it is possible to build a uniform (ε, δ)-cover in O log(1/ε)+log(1/δ) ε samples. It is interesting to ask whether non-uniformity is really necessary here, or whether ABM's analysis is simply sub-optimal. We'll show that the former is true, at least in the proper distribution-family setting: the log(1/ε) gap between these models is necessary and uniform covers cannot be used to build optimal semi-private learners.
Theorem G.4 (Separation of Uniform and Non-Uniform Covers). There exists an instance space X, hypothesis class H, and family of distributions D such that for any sufficiently small ε > 0, the following statements holds: equally likely to have been sampled from any of these distributions, the probability that A(S) is a proper ε-cover is at most: Pr[A fails given supp(S) = j < k] ≥ n−j k−j − m k n−j k−j .
Taking n sufficiently larger than m and k, we can make this probability as close to 1 as desired for any 0 < j < k. Finally, since samples of this form occur with probability at least 1/2, the algorithm fails with probability at least 1/3 as desired. It is left to prove Claim G.5.
Proof of Claim G.5. Notice that for any distribution unif(T ) ∈ D n,k , any i ∈ T and any j = i, d unif(T ) (h i , h j ) > 2ε. Let C be any proper ε-cover of H under distribution unif(T ). Then, by the above argument, it must contain {h i : i ∈ T }. Since |T | = k, C can be a proper ε-cover of H under at most |C| k distributions in D n,k .
We now move to proving that a proper non-uniform (ε, δ)-cover can be built in only O(log(1/δ)/ε) samples. This follows from the fact that for any n ≥ k and distribution unif(T ) ∈ D n,k , each i ∈ T is in the random sample S with probability 1 − δ. Since each h j for j / ∈ T is covered by h 0 , outputting {h i : i ∈ S} ∪ {h 0 } generates a proper non-uniform (ε, δ)-cover.
The construction in Theorem G.4 can easily be modified to give a class with the same gap which is not privately learnable (say by embedding a single copy of a threshold over [0, 1]). Since any such class requires at least Ω( 1 ε ) public samples to semi-privately learn by Theorem 8.17, 24 Theorem G.4 then provides a separation between using uniform and non-uniform covers in semi-private learning: the former provably requires an extra log factor, while the latter matches the lower bound exactly. Unfortunately, our proof of this result only holds in the proper setting, as Claim G.5 fails when improper hypotheses are allowed. We conjecture that this is not an inherent barrier: the separation should continue to hold in the improper case, albeit with some different analysis.
We have now seen a weak separation between uniform and non-uniform covers, but one might reasonably wonder whether a much stronger separation is possible. In particular, all previous constructions of uniform covers use uniform convergence, but there exist simple examples of learnable classes in the distribution-family model that fail this property: do such classes provide an example of objects which are non-uniformly coverable but not uniformly coverable? Surprisingly, the answer is no! It turns out that an algorithm for non-uniform covering can always be used to construct a uniform covering without too much overhead. Moreover, we'll see that the log(1/ε) gap is tight when (X, H) has finite VC dimension.
To prove this, it will actually be useful to make a brief aside and introduce another closely related notion of covering called fractional covers. These objects are essentially a form of non-uniform covering which output a single hypothesis instead of a set of them.
Definition G.6 (Fractional cover). Let X be an instance space, Y a label space, and let L X,Y denote the set of all labelings c : X → Y . A distribution D C over L X,Y is said to form a fractional (ε, p)-cover for a hypothesis class H for (D, X, H) if for any fixed h ∈ H, a sample from D C covers h with probability p: Fractional covers are closely connected to non-uniform covers. In fact, one can easily move between the two by sampling or subsampling.