Boosting simple learners

Boosting is a celebrated machine learning approach which is based on the idea of combining weak and moderately inaccurate hypotheses to a strong and accurate one. We study boosting under the assumption that the weak hypotheses belong to a class of bounded capacity. This assumption is inspired by the common convention that weak hypotheses are “rules-of-thumbs” from an “easy-to-learn class”. (Schapire and Freund ’12, Shalev-Shwartz and Ben-David ’14.) Formally, we assume the class of weak hypotheses has a bounded VC dimension. We focus on two main questions: (i) Oracle Complexity: How many weak hypotheses are needed in order to produce an accurate hypothesis? We design a novel boosting algorithm and demonstrate that it circumvents a classical lower bound by Freund and Schapire (’95, ’12). Whereas the lower bound shows that Ω(1/γ2) weak hypotheses with γ-margin are sometimes necessary, our new method requires only Õ(1/γ) weak hypothesis, provided that they belong to a class of bounded VC dimension. Unlike previous boosting algorithms which aggregate the weak hypotheses by majority votes, the new boosting algorithm uses more complex (“deeper”) aggregation rules. We complement this result by showing that complex aggregation rules are in fact necessary to circumvent the aforementioned lower bound. (ii) Expressivity: Which tasks can be learned by boosting weak hypotheses from a bounded VC class? Can complex concepts that are “far away” from the class be learned? Towards answering the first question we identify a combinatorial-geometric parameter which captures the expressivity of base-classes in boosting. As a corollary we provide an affirmative answer to the second question for many well-studied classes, including half-spaces and decision stumps. Along the way, we establish and exploit connections with Discrepancy Theory.


Introduction
Boosting is a fundamental and powerful framework in machine learning which concerns methods for learning complex tasks using combinations of weak learning rules. It offers a convenient reduction approach, whereby in order to learn a given classification task, it suffices to find moderately inaccurate learning rules (called "weak hypotheses"), which are then automatically aggregated by the boosting algorithm into an arbitrarily accurate one. The weak hypotheses are often thought of as simple prediction-rules: "Boosting refers to a general and provably effective method of producing a very accurate prediction rule by combining rough and moderately inaccurate rules of thumb." [32,Chapter 1] ". . . an hypothesis that comes from an easy-to-learn hypothesis class and performs just slightly better than a random guess." [33,Chapter 10: Boosting] In this work we explore how does the simplicity of the weak hypotheses affects the complexity of the overall boosting algorithm: let B denote the base-class which consists of the weak hypotheses used in the boosting procedure. For example, B may consist of all 1-dimensional threshold functions. 1 Can one learn arbitrarily complex concepts : R → {±1} by aggregating thresholds in a boosting procedure? Can one do so by simple aggregation rules such as weighted majority? How many thresholds must one aggregate to successfully learn a given target concept ? How does this number scale with the complexity of ?

Target-Class Oriented Boosting (traditional perspective).
It is instructive to compare the above view of boosting with the traditional perspective. The pioneering manuscripts on this topic (e.g. [20,31,12]) explored the question of boosting a weak learner in the Probably Approximately Correct (PAC) setting [34]: let H ⊆ {±1} X be a concept class; a -weak learner for H is an algorithm W which satisfies the following weak learning guarantee: let ∈ H be an arbitrary target concept and let be an arbitrary target distribution on X. (It is important to note that it is assumed here that the target concept is in H .) The input to W is a confidence parameter > 0 and a sample of 0 = 0 ( ) examples ( , ( ))), where the 's are drawn 1 I.e., hypotheses ℎ : R → {±1} with at most one sign-change.
independently from . The weak learning guarantee asserts that the hypothesis ℎ = W ( ) outputted by W satisfies E ∼ [ℎ( ) · ( )] ≥ , with probability at least 1 − . That is, W is able to provide a non-trivial (but far from desired) approximation to any target-concept ∈ H . The goal of boosting is to efficiently 2 convert W to a strong PAC learner which can approximate arbitrarily well. That is, an algorithm whose input consist of an error and confidence parameters , > 0 and a polynomial number of ( , ) examples, and whose output is an hypothesis ℎ ′ such that E ∼ [ℎ ′ ( ) · ( )] ≥ 1 − , with probability at least 1 − . For a text-book introduction see, e.g., [

Base-Class Oriented Boosting (this work).
In this manuscript, we study boosting under the assumption that one first specifies a fixed base-class B of weak hypotheses, and the goal is to aggregate hypotheses from B to learn target-concepts that may be far-away from B. (Unlike the traditional view of boosting discussed above.) In practice, the choice of B may be done according to prior information on the relevant learning task.
Fix a base-class B. Which target concepts can be learned? How "far-away" from B can be? To address this question we revisit the standard weak learning assumption which, in this context, can be rephrased as follows: the target concept satisfies that for every distribution over X there exists ℎ ∈ B such that E ∼ [ℎ( ) · ( )] ≥ .
(Notice that the weak learning assumption poses a restriction on the target concept by requiring it to exhibit correlation ≥ with B with respect to arbitrary distributions.) The weak learner W is given an i.i.d. sample of 0 ( ) random -labelled examples drawn from , and is guaranteed to output an hypothesis ℎ ∈ B which satisfies the above with probability at least 1− . In contrast with the traditional "Target-Class Oriented Boosting" perspective discussed above, the weak learning algorithm here is a strong learner for the base-class B in the sense that whenever there exists ℎ ∈ B which is -correlated with a target-concept with respect to a target-distribution , then W is guaranteed to find such an ℎ. The weakness of W is manifested via the simplicity of the hypotheses in B. 2 Note that from a sample-complexity perspective, the task of boosting can be analyzed by basic VC theory: by the existence of a weak learner whose sample complexity is 0 , it follows that the VC dimension of H is ( 0 ( )) for = 1/2. Then, by the Fundamental Theorem of PAC Learning, the sample complexity of (strongly) PAC learning H is (( + log(1/ ))/ ). This perspective of boosting is common in real-world applications. For example, the wellstudied Viola-Jones object detection framework uses simple rectangular-based prediction rules as weak hypotheses for the task of object detection [35].
Main Questions. We are interested in the interplay between the simplicity of the base-class B and the expressiveness and efficiency of the boosting algorithm. The following aspects will be our main focus: 1. Expressiveness: Given a small edge parameter > 0, how rich is the class of tasks that can be learned by boosting weak hypotheses from B? At what "rate" does this class grow as → 0? How about when B is a well-studied class such as Decision stumps or Halfspaces?
2. Oracle Complexity: How many times must the boosting algorithm apply a weak learner to learn a task which is -correlated with B? Can one improve upon thẽ (1/ 2 ) bound which is exhibited by classical algorithms such as Adaboost? Note that each call to the weak learner W amounts to solving an optimization problem w.r.t. B. Thus, saving upon this resource can significantly improve the overall running time of the algorithm.
The base-class oriented perspective has been considered by previous works such as [6, 13, 24, 14, 4, 21, 3, 28]. These works design specific learning algorithms that are based on aggregating hypotheses from the base-class. In particular these works remove the weak learner in the sense that the weak hypothesis which is obtained in each round is computed explicitly by optimizing an appropriate function on the data (e.g., maximizing the "margin" [32] or the "edge" [6]). In other words, instead of having an oracle access to an arbitrary learner which is only assumed to satisfy the weak learning assumption, these works use carefully tailored way of picking the next weak hypothesis from the base class . Consequently, the notion of oracle-complexity (which is a central resource in our framework) is irrelevant in these works. Furthermore, these works focus only on the standard aggregation rule by weighted majority, whereas the results in this manuscript exploit the possibility of using more complex rules and explore their expressiveness.
Outline. We begin with presenting the main definitions and results in Section 2: in Section 2.1 we present a new boosting method whose oracle complexity is only˜(1/ ) weak hypothesis, provided that they belong to a class of bounded VC dimension. We also analyze its generalization performence. In Section 2.2 we study limits on the expressivity of base-classes; that is, we address the questions which distributions can be learned by boosting an agnostic learner to a given base-class . Towards this end we identify to combinatorial-geometric dimensions called the -VC dimension and -interpolation dimension which provide quantitative bounds on the expressivity.
In Section 3 we overview the main technical ideas used in our proofs, and finally Section 4 and Section 5 contain the proofs: In Section 4 we prove the results regarding oracle-complexity, and in Section 5 the results regarding expressivity. Each of Section 4 and Section 5 can be read independently after Section 2 with one exception: the oracle-complexity lower bound in Section 4 relies on the theory developed in Section 5. Finally, Section 6 contains some suggestions for future research.

Main Results
In this section we provide an overview of the main results in this manuscript.
Weak Learnability. Our starting point is a reformulation of the weak learnability assumption in a way which is more suitable to our setting. Recall that the -weak learnability assumption asserts that if : X → {±1} is the target concept then, if the weak learner is given enough -labeled examples drawn from any input distribution over X, it will return an hypothesis which is -correlated with . Since here it is assumed that the weak learner is a strong learner for the base-class B, one can rephrase the weak learnability assumption only in terms of B using the following notion 3 : Thus, the -weak learnability assumption boils down to assuming that the target distribution is -realizable.
Note that for = 1 the notion of -realizability specializes to the classical notion of realizability (i.e., consistency with the class). Also note that as → 0, the set of -realizable samples becomes larger. 3 In fact, -realizability corresponds to the empirical weak learning assumption by [32, Chapter 2.3.2]. The latter is a weakening of the standard weak PAC learning assumption which suffices to guarantee generalization. 4 We note that one can relax the definition of -realizable distribution by requiring that a random sample from it is -realizable w.h.p. (rather than w.p. 1). Consequently, the results in this paper which use this definition also hold w.h.p. However, for the sake of exposition we work with the above definition.
Quantifying Simplicity. Inspired by the common intuition that weak hypotheses are "rulesof-thumb" [32] that belong to an "easy-to-learn hypothesis class" [33], we make the following assumption: A S S U M P T I O N 2 . 2 (Simplicity of Weak Hypotheses). Let B ⊆ {±1} X denote the base-class which contains the weak hypotheses provided by the weak learner. Then, B is a VC class; that is, VC(B) = (1).

Upper Bound (Section 4.1)
Can the assumption that B is a VC class be utilized to improve upon existing boosting algorithms?
We provide an affirmative answer by using it to circumvent a classical lower bound on the However, the "bad" weak learner W is constructed using a probabilistic argument; in particular the VC dimension of the corresponding base-class of weak hypotheses is (1). Thus, this result leaves open the possibility of achieving an (1/ 2 ) oracle-complexity, under the assumption that the base-class B is a VC class.
We demonstrate a boosting procedure called Graph Separation Boosting (Algorithm 1) which, under the assumption that B is a VC class, invokes the weak learner only˜( log(1/ ) ) times and achieves generalization error ≤ . We stress that Algorithm 1 is oblivious to the advantage parameter and to the class B. (I.e., it does not not "know" B nor .) The assumption that B is a VC class is only used in the analysis.
It will be convenient in this part to weaken the weak learnability assumption as follows: for any -realizable distribution , if W is fed with a sample ′ ∼ 0 then E ′ ∼ 0 corr W ( ′ ) ≥ /2. That is, we only require that expected correlation of the output hypothesis is at least /2 (rather than with high probability).
The main idea guiding the algorithm is quite simple. We wish to collect as fast as possible a set of weak hypotheses 1 , . . . , ∈ B that can be aggregated into a consistent hypothesis.  , 1 ), . . . , ( , )) which is -realizable by B, and a black-box oracle access to the weak learner W.   The following theorem shows that the (expected) number of calls to the weak learner until all pairs are separated is some = (log(| |)/ ). The theorem is stated in terms of the number of rounds, but as the weak learner is called one time per round, the number of rounds is equal to the oracle-complexity.  1). An important subtlety in Algorithm 1 is that it does not specify how to find the aggregation rule in Line 10. In this sense, Algorithm 1 is in fact a meta-algorithm.

T H E O R E M 2 . 3 (Oracle Complexity Upper Bound
It is possible that for different classes B one can implement Line 10 in different ways which depend on the structure of B and yields favorable rules . 5 In practice, one might also consider applying heuristics to find : e.g., consider the = (log / ) dimensional representation ↦ → ( 1 ( ), . . . , ( )) which is implied by the weak hypotheses, and train a neural network to find an interpolating rule . 6 (Recall that such an is guaranteed to exist, since 1 , . . . , separate all opposite-labelled pairs.) To accommodate the flexibility in computing the aggregation rule in Line 10, we provide a generalization bound which adapts to complexity of the aggregation rule. That is, a bound which yields better generalization guarantees for simpler rules. Formally, we follow the notation in ∈ R}, and the particular weighted majority in H which is outputted depends on the input sample .

T H E O R E M 2 . 4 (Aggregation-Dependent Bounds).
Assume that the input sample to Algorithm 1 is drawn from a distribution which is -realizable with respect to B. Let 1 . . . denote the hypotheses outputted by W during the execution of ℎ 1 on , and let H = H ( 1 . . . ) denote the aggregation class. Then, the following occurs with probability at least 1 − : 5 For example, when B is the class of one dimensional thresholds, see Section 4.1.

6
Observe in this context that the common weighted-majority-vote aggregation rule can be viewed as a single neuron with a threshold activation function.
where 0 is the sample complexity of the weak learner W. it is a function of the input sample of the algorithm. Thus, the generalization bound above does not follow from standard VC generalization bounds that apply for fixed (and data-independent) classes. The way we control this data dependency is via the notion of hybrid sample compression schemes [32]; recall that in standard sample compression schemes, the output hypothesis is a Proposition 2.5 generalizes a result by [5] who considered the case when = { } consists of a single function. (See also [11,10]). In Section 4 we state and prove Proposition 4.9 which gives an even more general bound which allows the 's to belong to different classes B 's.
Note that even if Algorithm 1 uses arbitrary aggregation rules, Proposition 2.5 still provides a bound of VC(H ( 1 . . . )) ≤ ( / * ) * , where * is the dual VC dimension of B. In particular, since B has VC dimension = (1) then also its dual VC dimension satisfies * = (1) and we get a polynomial bound on the complexity of This shows that indeed the impossibility result by [32] is circumvented when B is a VC class: indeed, in this case the sample size is bounded by a polynomial function of 1/ , 1/ .
Note however that obtained generalization bound is quite pessimistic (exponential in * ) and thus, we consider this polynomial bound interesting only from a purely theoretical perspective: it serves as a proof of concept that improved guarantees are provably possible when the baseclass B is simple. We stress again that for specific classes B one can come up with explicit and simple aggregation rules and hence obtain better generalization bounds via Theorem 2.4. We refer the reader to Section 4 for a more detailed discussion and the proofs.

Oracle Complexity Lower Bound (Section 4.2)
Given that virtually all known boosting algorithms use majority-votes to aggregate the weak hypotheses, it is natural to ask whether the (1/ ) oracle-complexity upper bound can be attained if one restricts to aggregation by such rules. We prove an impossibility result, which shows that a nearly quadratic lower bound holds when B is the class of halfspaces in R .

7
In more detail * ≤ 2 +1 − 1, and for many well-studied classes (such as halfspaces) the VC dimension and its dual are polynomially related [2].  Our proof of Theorem 2.7 is based on a counting argument which applies more generally; it can be used to provide similar lower bounds as long as the family of allowed aggregation rules is sufficiently restricted (e.g., aggregation rules that can be represented by a bounded circuit of majority-votes, etc).

Expressivity (Section 5)
We next turn to study the expressivity of VC classes as base-classes in the context of boosting.
That is, given a class B, what can be learned using oracle access to a learning algorithm W for B?
It will be convenient to assume that B ⊆ {±1} X is symmetric: This assumption does not compromise generality because a learning algorithm for B can be converted to a learning algorithm for {± : ∈ B} with a similar sample complexity. So, if B is not symmetric, we can replace it by {± : ∈ B}.
Our starting point is the following proposition, which asserts that under a mild condition, any base-class B can be used via boosting to learn arbitrarily complex tasks as → 0.

P R O P O S I T I O N 2 . 8 (A Condition for Universality
Item 1 implies that in the limit as → 0, any sample can be interpolated by aggregating weak hypotheses from B in a boosting procedure. Indeed, it asserts that any such sample satisfies the weak learning assumption for some > 0 and therefore given oracle access to a sufficiently accurate learning algorithm for B, any boosting algorithm will successfully interpolate .
Observe that every class B that contains singletons or one-dimensional thresholds satisfies Item 2 and hence also Item 1. Thus, virtually all standard hypothesis classes that are considered in the literature satisfy it.
It is worth mentioning here that an "infinite" version of Proposition 2.8 has been established for some specific boosting algorithms. Namely, these algorithms have been shown to be universally consistent in the sense that their excess risk w.r.t. the Bayes optimal classifier tends to zero in the limit, as the number of examples tends to infinity. See e.g. [

Measuring Expressivity of Base-Classes
Proposition 2.8 implies that, from a qualitative perspective, any reasonable class can be boosted to approximate arbitrarily complex concepts, provided that is sufficiently small. From a realistic perspective, it is natural to ask how small should be in order to ensure a satisfactory level of expressivity.

Q U E S T I O N 2 . 9.
Given a fixed small > 0, what are the tasks that can be learned by boosting a -learner for B? At which rate does this class of tasks grow as → 0?
To address this question we propose two combinatorial parameters called the -VC dimension and the -interpolation dimension which quantify the size/richness of the family of tasks that can be learned by aggregating hypotheses from B.

D E F I N I T I O N 2 .1 0 ( -interpolation).
Let B be a class and ∈ [0, 1] be an edge parameter. We say that a set Intuitively, when picking a base-class B, one should minimize the VC dimension (because then the weak-learning task is easier, and hence each call to the weak learner is less expensive), while maximizing the family of -interpolated sets (because then the overall boosting algorithm can learn more complex tasks). This gives rise to the following definition, which has been introduced by Chen, Minasyan, Lee, and Hazan [9].

D E F I N I T I O N 2 .1 1 ( -interpolation dimension).
Let B be a class and ∈ [0, 1] be an edge parameter. The -interpolation dimension of B, denoted ID (B), is the maximal integer ≥ 0 for which every subset of X of size is -interpolated. If B -interpolates every finite subset of X then its -interpolation dimension is defined to be ∞.
We note that this definition might be too restrictive in natural scenarios where it is impossible to -interpolate certain small degenerate sets. For example, consider a learning task where X = R and B is some geometrically defined class. In such cases, it might be more natural to quantify only over -interpolated sets that are in general position. Indeed, our results below regarding the expressiveness of half-spaces and decision-stumps are based on such relevant assumptions.
The following definition extends the classical VC dimension: Moreover, this bound is nearly tight as long as is not very small compared to log(1/ ): for every > 0 and ∈ N there is a class B of VC dimension = ( log(1/ )) and Thus, the fastest possible growth of the -VC dimension is asymptotically ≈ / 2 . We stress that the upper bound here implies an impossibility result; it poses a restriction on the class of tasks that can be approximated by boosting a -learner for B.
Note that the above lower bound is realized by a class B whose VC dimension is at least Ω(log(1/ )), which deviates from our focus on the setting where the VC dimension is a constant and → 0. Thus, we prove the next theorem which provides a sharp, subquadratic, dependence on (but a looser dependence on ).

T H E O R E M 2 .1 5 ( -VC dimension: improved bound for small ).
Let B be a class with VC dimension ≥ 1. Then, for every 0 < ≤ 1: , where (·) conceals a multiplicative constant that depends only on . Moreover, the above inequality applies for any class B whose primal shatter function 8 is at most .
As we will prove in Theorem 2.16, the dependence on in the above bound is tight. It will be interesting to determine tighter bounds in terms of .

Bounds for Popular Base-Classes.
We next turn to explore the -VC and -ID dimensions of two well studied geometric classes: halfspaces and decision stumps.
Let HS denote the class of halfspaces (also known as linear classifiers) in R . That is HS contains all concepts of the form " ↦ → sign( · + )", where ∈ R , ∈ R, and · denotes the standard inner product between and . This class is arguably the most well studied class Thus the class of halfspaces is rather expressive as a base-class; note that natural point sets such as grids are dense and hence meet the condition for being -interpolated by halfspaces.
We next study the -VC and ID dimensions of the class of Decision Stumps. A -dimensional decision stump is a concept of the form sign( ( − )), where ≤ , ∈ {±1} and ∈ R. In other words, a decision stump is a halfspace which is aligned with one of the principal axes. 8 The primal shatter function of a class B ⊆ {±1} X is the minimum for which there exists a constant such that for every finite ⊆ X, the size of B| = { | : ∈ B} is at most · | | . Note that by the Sauer-Shelah-Perles Lemma, the primal shatter function is at most the VC dimension.
This class is popular in the context of boosting, partially because it is easy to learn it, even in the agnostic setting. Also note that the Viola-Jones framework hinges on a variant of decision stumps [35].

T H E O R E M 2 .1 7 (Decision Stumps).
Let DS denote the class of decision stumps in R and ∈ (0, 1]. Then, Moreover, the dependence on is tight, already in the 1-dimensional case. In fact, for every such provided that there exists ≤ so that every pair of distinct points , ∈ satisfy ≠ . Thus, the class of halfspaces exhibits a near quadratic dependence in 1/ (which, by Theorem 2.15, is the best possible), and the class of decision stumps exhibits a linear dependence in 1/ . In this sense, the class of halfspaces is considerably more expressive. On the other hand the class of decision stumps can be learned more efficiently in the agnostic setting, and hence the weak learning task is easier with decision stumps.
Along the way of deriving the above bounds, we analyze the -VC dimension of onedimensional classes and of unions of one-dimensional classes. From a technical perspective, we exploit some fundamental results in discrepancy theory.

Technical Overview
In this section we overview the main ideas which are used in the proofs. We also try to guide the reader on which of our proofs reduce to known arguments and which require new ideas.

Lower Bound
We begin with overviewing the proof of Theorem 2.7, which asserts that any boosting algorithm which uses a (possibly weighted) majority vote as an aggregation rule is bound to call the weak learner at least nearly Ω( 1 2 ) times, even if the base-class has a constant VC dimension.
It may be interesting to note that from a technical perspective, this proof bridges the two parts of the paper. In particular, it relies heavily on Theorem 2.16 which bounds the -VC dimension of halfspaces.
The idea is as follows: let = ( ) denote the minimum number of times a boosting algorithm calls a -learner for halfspaces in order to achieve a constant population loss, say = 1/4. We show that unless is sufficiently large (nearly quadratic in 1 ), then there must exists a -realizable learning task (i.e., which satisfies the weak learning assumption) that cannot be learned by the boosting algorithm. We make two more comments about this proof which may be of interest.
First, we note that the set used in the proof is a regular 9 grid (this set is implied by Theorem 2.16). Therefore, the hard learning tasks which require a large oracle complexity are natural: the target distribution is uniform over a regular grid.
The second comment concerns our upper bound on H . Our argument here can be used to generalize a result by [5] regarding the composition of VC classes. They showed that

Upper Bound
Algorithm 1. We next try to provide intuition for Algorithm 1 and discuss some technical aspects in its analysis. The main idea behind the algorithm boils down to a simple observation: 9 Let us remark in passing that can be chosen more generally; the important property it needs to satisfy is that the ratio between the largest and smallest distance among a pair of distinct points in is ( 1/ ), see [ Thus, Algorithm 1 attempts to obtain as fast as possible weak hypotheses 1  Generalization Guarantees. As noted earlier, Algorithm 1 is a meta-algorithm in the sense that it does not specify how to find the aggregation rule in Line 10. In particular, this part of the algorithm may be implemented differently for different base-classes. We therefore provide generalization guarantees which adapt to the way this part is implemented.  Now, since B is a VC class, one can show that the number of cells is at most ( * ), where * is the dual VC dimension of B. This enables a description of any aggregation ( 1 . . . ) using ( * ) bits. 10 The complete analysis of this part appears in Proposition 2.5 and Corollary 2.6.
As discussed earlier, we consider that above bound of purely theoretical interest as it assumes that the aggregation rule is completely arbitrary. We expect that for specific and structured base-classes B which arise in realistic scenarios, one could find consistent aggregation rules more systematically and get better generalization guarantees using Theorem 2.4.

Expressivity
We next overview some of main ideas which are used to analyze the notions of -realizability and the -VC and -ID dimensions.
A Geometric Point of View. We start with a simple yet useful observation regarding the notion of -realizability: recall that a sample = (( 1 , 1 ) . . . ( , )) is -realizable with respect to B 10 Note that * = (1) since * < 2 +1 where = VC(B) = (1), and therefore the number of bits is polynomial in [2]. We remark also that many natural classes, such as halfspaces, satisfy * ≈ . To conclude, the results in discrepancy are directly related to -realizability when the distribution over the sample is uniform. However, arbitrary distributions require a special care. In some cases, it is possible to modify arguments from discrepancy theory to apply to nonuniform distributions. One such example is our analysis of the -VC dimension of halfspaces in Theorem 2.16, which is an adaptation of (the proof of) a seminal result in Discrepancy theory due to [1]. Other cases, such as the analysis of the -VC of decision stumps require a different approach. We discuss this in more detail in the next paragraph.
Linear Programming. Theorem In a nutshell, the idea is to consider a small finite set of decision stumps ⊆ DS of size | | ≤ /2 with the property that for every decision stump ∈ DS there is a representative ∈ such that the number of 's where ( ) ≠ ( ) is sufficiently small (at most ( / )).
That is, and agree on all but at most a (1/ ) fraction of the 's. The existence of such a set follows by a Haussler's Packing Lemma [18]. Now, since | | ≤ /2, we can find many pairs ( ì, ì) such that This follows by a simple linear algebraic consideration (the intuition here is that there are only /2 constraints in Equation (2) but degrees of freedom). We proceed by using a Linear Program to define a polytope which encodes the set of all pairs ( ì, ì) which satisfy Equation (2), and arguing that a vertex of this polytope corresponds to a pair ( ì, ì) which satisfies Equation (1), as required.
The above argument applies more generally for classes which can be represented as a small union of 1-dimensional classes (see Proposition 5.8).

Oracle-Complexity
In this section we state and derive the oracle-complexity upper and lower bounds. We begin with the upper bound in Section 4.1, where we analyze Algorithm 1, and then derive the lower bound in Section 4.2, where we also prove a combinatorial result about composition of VC classes which may be of independent interest.

Oracle Complexity Upper Bound
Our results on the expressivity of boosting advocate choosing a simple base-class B, and use it via boosting to learn concepts which may be far away from B by adjusting the advantage parameter . We have seen that the overall boosting algorithm becomes more expressive as becomes smaller. On the other hand, reducing also increases the difficulty of weak learning: indeed, detecting a -correlated hypothesis in B amounts to solving an empirical risk minimization problem over a sample of (VC(B)/ 2 ) examples. It is therefore desirable to minimize the number of times the weak learner is applied in the boosting procedure.
Improved Oracle Complexity Bound. The optimal oracle complexity was studied before in [32,Chapter 13], where it was shown that there exists a weak learner W such that the population loss of any boosting algorithm after interactions with W is at least exp(− ( 2 )).
One of the main points we wish to argue in this manuscript is that one can "bypass" impossibility results by utilizing the simplicity of the weak hypotheses. We demonstrate this by presenting a boosting paradigm (Algorithm 1) called "Graph-Separation Boosting" which circumvents the lower bound from [32].
Parameters: a base-class B, a weak learner W with sample complexity 0 , an advantage parameter > 0.  , 1 ), . . . , ( , )) which is -realizable by B, and a black-box oracle access to the weak learner W.     While virtually all boosting algorithms (e.g., AdaBoost and Boost-by-Majority) employ majority vote rules as aggregation functions, our boosting algorithm allows for more complex aggregation functions. This enables the quadratic improvement in the oracle complexity.
We now describe and analyze our edge separability-based boosting algorithm. Throughout the rest of this section, fix a base-class B ⊆ {±1} X , an edge parameter > 0, and a weak learner denoted by W. We let 0 denote the sample complexity of W and assume that for every distribution which is -realizable with respect to B: where corr (ℎ) = E ( , )∈ [ℎ( ) · ] is the correlation of ℎ with respect to .
The main idea behind the algorithm is simple. We wish to collect as fast as possible a sequence of base classifiers 1 , . . . , ∈ B that can be aggregated to produce a consistent hypothesis, i.e., a hypothesis ℎ ∈ {±1} X satisfying ℎ( ) = for all ∈ [ ]. The next definition and lemma provide a sufficient and necessary condition for reaching such hypothesis.

T H E O R E M 4 . 3 (Oracle Complexity Upper Bound (Theorem 2.3 restated)). Let be an input sample of size which is -realizable with respect to B, and let denote the number of rounds
Let corr (ℎ) := E ∼ [ · ℎ( )]. Therefore, by the definition of : (4) where in the first transition we used that E[ ] = ∞ =1 Pr[ ≥ ] for every random variable ∈ N. ■

Aggregation-Dependent Generalization Bound
As discussed in Section 2.1, Algorithm 1 is a meta-algorithm in the sense that it does not specify how to find the aggregation rule in Line 10. In particular, this part of the algorithm may be implemented in different ways, depending on the choice of the base-class B. We therefore provide here a generalization bound whose quality adapts to the complexity of this stage. That is, the guarantee given by the bound improves with the "simplicity" of the aggregation rule.
More formally, we follow the notation in [ where 0 is the sample complexity of the weak learner W. For Item 2, we use the hybrid-compression generalization bound from [32]: sign-changes as well. So, H in this case is the class of all sign functions that change sign at most (1/ ) times whose VC dimension is (1/ ). Note that in this example the bound on VC(H ) does not depend on , which is different (and better) then the bound when H is defined with respect to aggregation by weighted majority. More generally, the following proposition provides a bound on VC(H ) when it is known that the aggregation rule belongs to a restricted class : This is summarized in the following corollary.
As discussed earlier, we consider the above bound of purely theoretical interest as it assumes that the aggregation rule is completely arbitrary. We expect that for specific and structured base-classes B which arise in realistic scenarios, one could find consistent aggregation rules more systematically and as a result to also get better guarantees on the capacity of the possible aggregation rules.

Oracle Complexity Lower Bound
We next prove a lower bound on the oracle complexity showing that if one restricts only to boosting algorithms which aggregate by weighted majorities then a near quadratic dependence in 1/ is necessary to get generalization, even if the base-class B is assumed to be a VC class.
In fact, the theorem shows that even if one only wishes to achieve a constant error = 1/4 with constant confidence = 1/4 then still nearly 1/ 2 calls to the weak learner are necessary, where is the advantage parameter. times in order to output an hypothesis ℎ such that with probability at least 1 − = 3/4 it satisfies corr (ℎ) ≥ 1 − = 3/4. TheΩ above conceals multiplicative factors which depend on and logarithmic factors which depend on 1/ .

P R O O F .
Let us strengthen the weak learner W by assuming that whenever it is given a sample from a -realizable distribution then it always outputs a ℎ ∈ HS such that corr (ℎ) ≥ (i.e., it outputs such an ℎ with probability 1). Clearly, this does not affect generality in the context of proving oracle complexity lower bounds (indeed, if the weak learner sometimes fails to return a -correlated hypothesis then the number of oracle calls may only increase).

The VC Dimension of Composition
We conclude this part by demonstrating how the argument used in the above lower bound can extend a classical result by [5]. This generalizes a result by [5] who considered the case when = { } consists of a single function.
Note that this assumption does not compromise generality because: (i) a learning algorithm for B implies a learning algorithm for {± : ∈ B}, and (ii) VC({± : ∈ B}) ≤ VC(B) + 1. So, if B is not symmetric, we can replace it by {± : ∈ B}.
Organization. We begin with stating and proving a basic geometric characterization ofrealizability in Section 5.1, which may also be interesting in its own right. This characterization is then used to prove Proposition 2.8, which implies that virtually all VC classes which are typically considered in the literature are expressive when used as base-classes. Then, in Section 5.2 we provide general bounds on the growth rate of the -VC dimension. We conclude the section by analyzing the classes of Decision Stumps (Section 5.3) and of Halfspaces (Section 5.4).

A Geometric Perspective of -realizability
The following simple lemma provides a geometric interpretation of -realizability and the -VC dimension, which will later be useful. Note that this lemma can also be interpreted in terms of norms. Indeed, since B is symmetric, the set

A Condition for Universal Expressivity
The following proposition asserts that under mild assumptions on B, every sample isrealizable for a sufficiently small = ( ) > 0. This implies that in the limit as → 0, it is possible to approximate any concept using weak-hypotheses from B. Observe that every class B that contains singletons or one-dimensional thresholds satisfies  ( 1 ), . . . , ( )) ∈ R : ∈ B} and hence this space is -dimensional as required. 13 More precisely, it is possible to interpolate arbitrarily large finite restriction of any concept. We note in passing that a result due to [3] provides an infinite version of the same phenomena: under mild assumptions on the base-class B, they show that a variant of AdaBoost is universally consistent.

General Bounds on the -VC Dimension
In the remainder of this section we provide bounds on the -VC dimension for general as well as for specific well-studied classes. As we focus on the dependence on , we consider the VC dimension to be constant. In particular, we will sometimes use asymptotic notations , Ω which conceal multiplicative factors that depend on .

T H E O R E M 5 . 3 (Theorem 2.14 restatement).
Let B be a class with VC dimension . Then, for every 0 < ≤ 1: Moreover, this bound is nearly tight as long as is not very small comparing to log(1/ ): for every > 0 and ∈ N there is a class B of VC dimension = ( log(1/ )) and Thus, the fastest possible growth of the -VC dimension is asymptotically ≈ / 2 . We stress however that the above lower bound is realized by a class B whose VC dimension is at least Ω(log(1/ )), which deviates from our focus on the setting the VC dimension is a constant and → 0. Thus, we prove the next theorem which provides a sharp, subquadratic, dependence on (but a looser dependence on ).

T H E O R E M 5 . 4 ( -VC dimension: improved bound for small (Theorem 2.15 restatement)).
Let B be a class with VC dimension ≥ 1. Then, for every 0 < ≤ 1:

+1
, where (·) conceals a multiplicative constant that depends only on . Moreover, the above inequality applies for any class B whose primal shatter function 14 is at most .
As follows from Theorem 2.16, the dependence on in the above bound is tight.

Proof of Theorem 2.14
To prove the upper bound, let B have VC dimension , let > 0, and let ⊆ X be a set of size VC (B) such that every labeling of it is -realizable by B. Fix : → {±1}. By Lemma 5.1 there is a probability distribution on B so that This implies, using a Chernoff and union bounds, that is a majority of ( log| | 2 ) restrictions of hypotheses in B to . As this holds for any fixed it follows that each of the 2 | | distinct ±1 patterns on is the majority of a set of at most ( log| | 2 ) restrictions of hypotheses in B to . By the Sauer-Perles-Shelah Lemma [30] there are less than ( | |/ ) such restrictions, and hence completing the proof of the upper bound.
To prove the lower bound we need the following simple lemma.
Thus, there is an so that the inner product ⟨ , ⟩ 2 ≥ 1 , as needed. ≥ . Therefore, we conclude that: This completes the proof of Theorem 2.14.

Proof of Section 2.15: An Improved Bound using Discrepancy Theory
There is an intimate relationship between the -VC dimension and Discrepancy Thoery (see, e.g., the book [25]). As a first application of this relationship, we prove Theorem 2.15 by a simple reduction to a classical result in Discrepancy Theory. We begin by introducing some notation. Let be a family of sets over a domain and let denote the size of . Discrepancy theory studies how balanced can a coloring of be with respect to . That is, for a coloring : → {±1} and a set ∈ define the discrepancy of with respect to by Define the discrepancy of with respect to by Finally, the discrepancy of is defined as the discrepancy of the "best" possible coloring: Low Discrepancy implies large -VC Dimension. A classical result due to [27,26] asserts that every family of subsets over with a small VC dimension admits a relatively balanced coloring: where is the VC dimension of and is a constant depending only on (see also Theorem Note that since B is symmetric, it follows that supp( ), supp(− ) ∈ for every ∈ B, and also note that VC( ) = VC(B) = . Let denote the uniform distribution over . For every ∈ B: (by Equation (10) applied on the family .) In particular, as by assumption, the sample ( , ( )) ∈ is -realizable, it follows that ≤ 2 | | − 1 2 + 1 2 and therefore as required. □

Decision Stumps
We next consider the class of Decision Stumps. A -dimensional decision stump is a concept of the form sign( ( − )), where ≤ , ∈ {±1} and ∈ R. In other words, a decision stump is a halfspace which is aligned with one of the principal axes. This class is popular in the context of boosting, partially because it is easy to learn it, even in the agnostic setting. Also note that the Viola-Jones framework hinges on a variant of decision stumps [35].
Moreover, the dependence on is tight, already in the 1-dimensional case. In fact, for every such For > 1, the class of -dimensional decision-stumps -interpolates every set ⊆ R of size 1/ , provided that there exists ≤ so that every pair of distinct points , ∈ satisfy ≠ .
The proof of Theorem 2.17 follows from a more general result concerning the union of classes with VC dimension equal to 1. We note that the bounds are rather loose in terms of : the upper bound yields a bound of ( / ) while the lower bound gives only Ω(1/ ). Also note that since the VC dimension of decision stumps is (log ) (see [16] for a tight bound), Theorem 2.14 implies an upper bound of˜(log / 2 ). It would be interesting to tighten these bounds.

Proof of Theorem 2.17
Lower Bound on ID (DS 1 ). We need to show that for every such that 1/ ∈ N every set The case of > 1 follows by a simple reduction to the = 1 case: let ⊆ R be of size 1/ such that there exists ≤ for which every pair of distinct points , ∈ satisfy ≠ . Then, by projecting on the first coordinate we obtain a 1-dimensional set which is -interpolated by DS 1 by the above argument. Equivalently, is -interpolated by decision-stumps that are The proof idea is to derive labels ( 1 . . . ) ∈ {±1} and a distribution over such that (i) for every , every ∈ satisfies E ∼ [ · ( )] = 0, and (ii) is sufficiently close to the uniform distribution over (in ℓ 1 distance). Then, since is sufficiently close to uniform and since the 's are -covers for = ( / ) with respect to the uniform distribution, it will follow that E ∼ [ ( ) · ( )] ≤ ( / ) for all ∈ B, which will show that = ( / ) as required.
To construct and we consider the polytope defined by the following Linear Program (LP) on variables 1 , . . . , with the following constraints: Consider a vertex = ( 1 , . . . , ) of this polytope. Since the number of equality constraints is at most /2, there are must be at least /2 inequality constraints that meets with equality.

Halfspaces
For halfspaces in R , we give a tight bound on its -VC dimension (in terms of ) of Θ 1 The proof of Theorem 2.16 is based on ideas from Discrepancy theory. In particular, it relies on the analysis of the discrepancy of halfspaces due to [1] (see [25] for a text book presentation of this analysis).

Tools and Notation from Discrepancy Theory
Weighted Discrepancy. Let a (discrete) distribution over X and let : X → {±1} be a labeling of X which we think of as a coloring. For an hypothesis : X → {±1}, define the -weighted discrepancy of with respect to by disc ( ; ) = ∑︁ : ( )=1 ( ) · ( ).
The following simple identity relates the weighted discrepancy with -realizability. For every distribution , target concept : X → {±1} and hypothesis : X → {±1}: E ∼ [ ( ) · ( )] = disc ( ; ) − disc ( ; − ). (11) Motion Invariant Measures. The proof of Theorem 2.16 uses a probabilistic argument. In a nutshell, the lower bound on the -VC dimension follows by showing that if is dense then each of its 2 | | labelings are -realizable. Establishing -realizability is achieved by defining a special distribution over halfspaces such that for every distribution on and every labeling : → {±1}, a random halfspace ∼ is -correlated with with respect to . That is, The special distribution over halfspaces which has this property is derived from a motion invariant measure: this is a measure over the set of all hyperplanes in R which is invariant under applying rigid motions (i.e., if ′ is a set of hyperplanes obtained by applying a rigid motion on a set of hyperplanes, then the measure of and ′ is the same). It can be shown that up to scaling, there is a unique such measure (similar to the fact that the Lebesgue measure is the only motion-invariant measure on points in R ). We refer the reader to [25,Chapter 6.4] for more details on how to construct this measure and some intuition on how it is used in this context.
One property of this measure that we will use, whose planar version is known by the name the Perimeter Formula, is that for any convex set the set of hyperplanes which intersect has measure equal to the boundary area of . Note that this implies that whenever the boundary area of is 1, then this measure defines a probability distribution over the set of all hyperplanes intersecting .

Proof of Theorem 2.16
The following lemma is the crux of the proof. Theorem 2.16 is implied by Lemma 5.11 as follows: let be a dense set as in 5.11.
We distinguish between two cases: (i) if disc ( ; − ) ≤ 0, then by Equation (11) The main difference is that we consider weighted discrepancy whereas the proof in [25] handles the unweighted case. We therefore describe the modifications needed to incorporate weights.
Following [25] we restrict our attention to the 2-dimensional case and to sets which are 1/2 × 1/2 -regular grids. The extension of our result to the general -dimensional case is identical to the extension described in [25, page 191].
Following [25] denote by a motion-invariant measure on the set of lines which intersect S. Note that is indeed a probability distribution, because the perimeter of S is 1. By identifying every line with the upper 15 halfplane it supports, we view as a distribution over halfplanes.
We  15 We may ignore vertical lines as their -measure is 0.

Conclusion and Open Problems
We conclude the paper with some suggestions for future research: Algorithm 1 suggests a possibility of improved boosting algorithms which exploit the simplicity of the base-class and use more complex ("deeper") aggregation rules. It will be interesting to explore efficient realizations of Algorithm 1, for realistic base-classes B.
The bounds provided on the -VC dimensions of halfspaces and decision stumps are rather loose in terms of . It will be interesting to find tight bounds. Also, it will be interesting to explore how the -VC dimension behaves under natural operations. For example, for > 0 consider the class B ′ of all -wise majority votes of hypotheses from B. How does VC (B ′ ) behaves as a function of and VC (B)?
Characterize for which classes B there exist boosting algorithms which output weighted majorities of the base hypotheses using much less than˜( −2 ) oracle calls; e.g., for which classes is it possible to use only˜( −1 ) oracle calls?