Anonymous: That’s actually an excellent question! The “alternative algorithm” in the Buescher-Kumar paper that I alluded to — the one that sometimes outperforms the obvious algorithm — gains its advantage by a sort of cross-validation. As I said, it seems to be open whether that helps for finite concept classes.

]]>For instance when inferring the mean of an unknown d-dimensional Gaussian under squared error, minimum empirical risk estimator is not admissible. In other words there are estimators that uniformly dominate the estimator that minimizes squared loss on training data.

Recently similar effect was shown for more general distributions (albeit asymptotically) http://www.ism.ac.jp/~eguchi/pdf/MLE.pdf

As far as the open problem, I wonder if the simpler claim holds — suppose hypothesis space is finite — is there a problem where non-obvious algorithm does uniformly better?

]]>The “obvious” algorithm, which seeks to fit the observations, seems to be much like a maximum-likelihood estimator. That is, it’s frequentist. The observed data are everything.

An alternative is a Bayesian algorithm. It would start with a prior over functions, P0(f). Then it would update based on observations [P0(f) –> P(f)], and report the mean of P(f) [1].

To see why this might be better, consider MLE on a coin. You want to report the coin’s bias, P_heads. You flip the coin once, get heads, and conclude that P_heads = 1 fits the data best. ‘Course, this is a fairly ballsy statement for a guy who’s only done one measurement.

Bayesian might start with a uniform prior P( P_heads ) = 1 for all P_heads in [0..1]. It would then report P_heads = 2/3 as its best guess.

Basically, the problem with MLE is that the future may look very different from the past. The prior in Bayesian methods serves as a sort of reminder (to the algorithm) not to get too cocky! “There are more things in heaven and earth, Horatio, than are dreamt of in your philosophy.”

[1] Note that the alternative method requires {f} to be a convex, measurable set.

]]>(This is the downside of posting research stuff: anything I post might quickly become irrelevant. Also, no one besides me really cared to begin with.)

]]>– Don’t forget the Curse of Dimensionality (TM).

]]>Consider the class of functions H from [0,1) to [0,1], which consists of the constant function h_0(x)=1/4, as well as (for all positive integers k) the function h_k(x) that returns the kth bit in the binary expansion of x.

Given training data of the form

x_1,f(x_1)

…

x_m,f(x_m),

our goal is to pick the function in H that best approximates f in the L1 norm.

Now suppose f is the constant function f(x)=0. Then here’s the point: if x_1,…,x_m are chosen uniformly random from [0,1), then with probability 1 there exists a k>0 such that

h_k(x_i) = f(x_i) = 0

for all i=1,…,m. In other words, there exists a k>0 such that for every i, the kth bit in the binary expansion of x_i is 0.

So if we use the “obvious” learning algorithm — the one that looks for a hypothesis that agrees with all the training data — then we’ll pick such an h_k. But that’s not what we *should have* done, since h_0 is a better approximation to f than h_k for any k>0. In particular, h_1,h_2,… all have L1 distance 1/2 to f, whereas h_0 has L1 distance only 1/4 to f.

Buescher and Kumar give an alternative learning algorithm that *does* pick h_0.

One obvious problem with this counterexample is that the class of hypotheses is infinite (and indeed, has infinite VC-dimension). What happens if we truncate the class of hypotheses to h_0,…,h_n? I think what happens is that the sample complexities for both the obvious algorithm and the non-obvious one go down to ~log(n).

Is there a case where the VC-dimension is finite, but a non-obvious algorithm has *sample complexity* asymptotically smaller than the obvious algorithm’s? That seems to be an open problem. All Buescher and Kumar show is that the known upper bounds can differ by a constant factor.

So to summarize: there exists a weird concept class with infinite VC-dimension, for which the obvious algorithm fails to learn with *any* finite number of samples, but a different algorithm succeeds. On the other hand, the problem I was originally thinking of — whether the obvious algorithm can yield a non-optimal bound on sample complexity in terms of VC-dimension — is apparently still open.

no?

]]>