Deoxyribononapproximability

[Updates]

Alright, here’s a problem for all you bioinformatistas and inapproximabistas out there, which was inspired by this post of Eliezer Yudkowsky at Overcoming Bias (see also the comments there).

Let a DNA sequence be an element of {A,C,G,T}*, and suppose we’re allowed the following primitive operations: (1) insert a base pair anywhere we want, (2) delete any substring, (3) reverse any substring, and (4) copy any substring into any other part of the string. Then given a DNA sequence S, how hard is it to estimate the minimum number of operations needed to produce S starting from the empty string?

Closely related is the following problem: by starting from the empty string and applying o(n) operations, can we produce a “pseudorandom DNA sequence” of length n — that is, a sequence that can’t be distinguished in polynomial time from a uniform random one?

(Note 1: For both problems, we might also want to stipulate that every intermediate sequence should have size at most polynomial in n. Or better yet, maybe one can prove that such an assumption is without loss of generality.)

(Note 2: I’m also very interested in what happens if we disallow the powerful operation of reversal.)

For all I know, these problems might have trivial (or at any rate, known) answers; I just came up with them and haven’t thought them through.

What the problems are really getting at is this: is the “effective number of bits” in your genome (that is, the number of bits from a polynomial-time algorithm’s perspective) limited by how many ancestors you’ve had since life on Earth began? Or can it be vastly greater?

Update (11/4): Rereading the last few paragraphs of Eliezer’s post, I see that he actually argues for his central claim — that the human genome can’t contain more than 25MB of “meaningful DNA” — on different (and much stronger) grounds than I thought! My apologies for not reading more carefully.

In particular, the argument has nothing to do with the number of generations since the dawn of time, and instead deals with the maximum number of DNA bases that can be simultaneously protected, in steady state, against copying errors. According to Eliezer, copying a DNA sequence involves a ~10-8 probability of error per base pair, which — because only O(1) errors per generation can be corrected by natural selection — yields an upper bound of ~108 on the number of “meaningful” base pairs in any given genome.

However, while this argument is much better than my straw-man based on the number of generations, there’s still an interesting loophole. Even with a 10-8 chance of copying errors, one could imagine a genome reliably encoding far more than 108 bits (in fact, arbitrarily many bits) by using an error-correcting code. I’m not talking about the “local” error-correction mechanisms that we know DNA has, but about something more global — by which, say, copying errors in any small set of genes could be completely compensated by other genes. The interesting question is whether natural selection could read the syndrome of such a code, and then correct it, using O(1) randomly-chosen insertions, deletions, transpositions, and reversals. I admit that this seems unlikely, and that even if it’s possible in principle, it’s probably irrelevant to real biology. For apparently there are examples where changing even a single base pair leads to horrible mutations. And on top of that, we can’t have the error-correcting code be too good, since otherwise we’ll suppress beneficial mutations!

Incidentally, Eliezer’s argument makes the falsifiable prediction that we shouldn’t find any organism, anywhere in nature, with more than 25MB of functional DNA. Does anyone know of a candidate counterexample? (I know there are organisms with far more than humans’ 3 billion base pairs, but I have no idea how many of the base pairs are functional.)

Lastly, in spite of everything above, I’d still like a solution to my “pseudorandom DNA sequence” problem. For if the answer were negative — if given any DNA sequence, one could efficiently reconstruct a nearly-optimal sequence of insertions, transpositions, etc. producing it — then even my original straw-man misconstrual of Eliezer’s argument could put up a decent fight!

Update (11/5): Piotr Indyk pointed me to a paper by Ergün, Muthukrishnan, and Sahinalp from FSTTCS’2003, which basically solves my problem in the special case of no reversals. It turns out that you can estimate the number of insert, delete, and copy operations needed to produce a given DNA sequence to within a factor of 4, by just applying Lempel-Ziv compression to the sequence. Thanks, Piotr!

Another Update (11/5): Andy Drucker has pointed out that, in the case where reversals are allowed, we can approximate the number of insert, delete, copy, and reverse operations needed to produce a given DNA sequence to within a factor of 16, by combining the Lempel-Ziv approach of Ergün et al. with a clever trick: maintain both the sequence and its reversal at all times! Interestingly, though, this trick doesn’t seem to work for transforming one sequence into another (a more general problem than I asked about, and the one considered by Ergün et al).

50 Responses to “Deoxyribononapproximability”

  1. harrison Says:

    Without reversals, it seems to me like you’ll need \Omega(n/log n) deletions to obtain a pseudorandom sequence of length n.

    (Handwave-y, Swiss-cheese-y argument: If you repeatedly copy the entire sequence and then paste it wherever, it’ll take you ~log n operations. If you can insert O(1) base pairs in between copy-paste operations, then it’s still O(log n). But then after you fix a subsequence of length O(log n), the rest of the sequence is mostly determined! So the best you can do is get about n/log n substrings of length about log n.)

    Of course, there’s so many holes in the above argument that I won’t even bother to point them all out. The point’s that the number of cuts necessary seemingly grows too fast to get much faster than o(n). (And if you try to slow it down — say, by adding sqrt(n) bases in between doubling operations instead of O(1) — I’d wager that the resulting sequence is too redundant to be pseudorandom.)

    That said, I’d love to be proved wrong…and chances are doing that won’t even be that hard.

  2. Job Says:

    If i understand correctly (which is not a given) i think that if we allow for any ACGT string, then an algorithm can’t do any better than O(n).

    A rough argument: suppose that for every ACGT string s of length n there is a sequence of f(n) operations (where f(n) is always less than n) that produces s.

    Then we can encode any s of length n into less than n characters (the encoding would be the sequence of operations encoded). By applying this process in succession we would be able to compress any s into a handful of chars, which isn’t possible (the pigeon hole thing?).

    So in the worst case you’ll always need a sequence of n or more operations to build s (if we allow for every s). So an algorithm wanting to identify the minimal sequence of operations needed to build s in the worst case will come up against a sequence of n or more operations. Just outputting these will require O(n) operations.

  3. GreedyAlgorithm Says:

    Job: The assumption is false, as there is no sequence of 0 operations that will produce a string of length 1.

  4. Job Says:

    I agree, but it wasn’t an assumption, it was a supposition for the sake of a contradiction.

  5. justplainuncool Says:

    Job – the problem with your “proof” is that the encoding takes more than O(1) characters for each operation. For the reversal operation, you need to write down the character for reversal and both the beginning and end numbers of the string – which takes O(log n) characters. Same thing for copying – which actually is worse, since 3 numbers are needed there.

  6. Job Says:

    justplainuncool, yeah i saw that but i purposedly overlooked it :p . It would be 2/3 numbers extra for each operation each of log(log(n)) bits. There is still a statement there somewhere. I think adding these extra parameters in, in the worst case it would cause us to say that, for every string s where n > some constant, … etc, which is just as bad i think.

  7. Piotr Says:

    The paper:

    Funda Ergün, S. Muthukrishnan, Süleyman Cenk Sahinalp: Comparing Sequences with Segment Rearrangements. FSTTCS 2003: 183-194

    presents a constant factor approximation for the problem of approximating block edit distance between two strings. I am not sure if the allowed operations are the same as in your model though.

  8. Jay Says:

    Job: maybe I’m just being obtuse, but I fail to see how either the weak or strong form of the pigeonhole principle relates to any limit of the sort you are describing (weak: if you have N+1 objects and N holes, at least one hole has at least two objects; strong: if you have N holes, and an associated natural number for each hole, and the number of objects you have is the sum of those associated numbers minus N plus one, then at least one hole has at least as many objects as the associated number). I realize the reason for your question mark was probably that you were convinced that some principle meant the compression couldn’t get lower than some particular limit, but not sure which one; however, if that was the principle you were thinking of, I’d be very interested in finding out what I’m not thinking of 🙂

  9. Scott Says:

    Piotr: Thanks! I’ll take a look.

  10. Job Says:

    Jay, i was going for: If we have n bits then we can encode 2^n states. If every string of n bits can be compressed into less than n bits then the 2^n strings can be encoded into 2^(n-1) states or less – meaning that 2 or more strings would need to be encoded into a single one of the 2^(n-1) states (weak PHP i guess) which isn’t possible.

    With the piece about being able to encode any string into a handful of chars i was just going further to say that if any string of n bits can be encoded into n-1 bits, then we can apply that successively and encode any string of n bits into a single bit.

  11. Job Says:

    By the way, i think from now on i’ll put “undergrad speaking” in my comments because of the audience that usually comments here.
    I want the immunity for saying something stupid. 😛

  12. Scott Says:

    Job, so long as all the intermediate strings have length polynomial in n, it’s obvious that you need Ω(n/log n) operations to produce an arbitrary string of length n. But that wasn’t the question.

    The question is whether, using much fewer than n operations, you can produce a string that can’t be efficiently distinguished from a random string. (Or at least from some string requiring ~n operations.)

    But hey, as long as you’ve brought up Shannon lower bounds, there are some interesting questions there too:

    (1) Can the obvious Ω(n/log n) lower bound for the number of operations to produce a truly random string be improved to Ω(n)?

    (2) What if the intermediate strings can be exponentially long? In this case, the only lower bound I see immediately is Ω(√n) — since either you have to spend √n operations repeatedly doubling the string until it has length 2√n, or else each operation takes fewer than √n bits to specify and we get a Shannon lower bound of n/√n = √n. Can this lower bound be improved, maybe even to Ω(n)?

  13. Scott Says:

    OK, I just read the paper by Ergün et al. (which is available here). It’s indeed extremely relevant.

    Ergün et al. don’t allow reversals, but other than that their model is exactly the same. Their main result is a linear-time (!) algorithm that approximates, to within a factor of 4, the number of operations needed to get from one DNA sequence to another. In particular, one can 4-approximate the number of insert, delete, and copy operations needed to produce a given DNA sequence S from the empty sequence, by just applying Lempel-Ziv compression to S.

    They also point out that getting an exact answer is NP-hard (at least for the problem of transforming one sequence to another — I don’t know about transforming the empty sequence to a given one).

    Interestingly, if we only allow insert and copy operations — no deletions — then Ergün et al. say the best known approximation factor is O(log n), and they cite a SODA result of Lehman and Shelat showing that “any significant improvement to this approximation would require progress on a longstanding algebraic problem.”

  14. Scott Says:

    Duhhhhh….

    I just realized that one can produce any string of length n using O(n/log n) insert, delete, and copy operations (reversals aren’t even needed). This falsifies Job’s claim of an Ω(n) lower bound.

    Here’s the algorithm: let m=n/log n. Then first produce a string Am consisting of all m strings of length log(m). As an example,

    A3 = 000 001 010 011 100 101 110 111.

    To produce Am for m>1, first produce Am/2, then copy it, and append 0’s to one copy of Am/2 and 1’s to the other copy. Let T(m) be the number of insert and copy operations needed to produce Am. Then we get the recurrence relation T(m) = m + T(m/2), which we solve to obtain T(m) = O(m) = O(n/log n).

    Then given any n-bit string S, we can get S from Am with only n/log m = O(n/log n) operations, by simply copying log(m)-bit chunks. The final step is to erase Am.

    It’s an interesting question whether n/log n is tight if one allows exponentially-long intermediate strings (plus possibly reversals).

  15. michael vassar Says:

    “Incidentally, Eliezer’s argument makes the falsifiable prediction that we shouldn’t find any organism, anywhere in nature, with more than 25MB of functional DNA. ”
    No it doesn’t. Some species have more than 4 offspring per mated pair on average, allowing more bits of information per generation. Other species have lower mutation rates per generation. Some are far from evolutionary equilibria and result from recent plant hybridization events where multiple whole genomes are combined into a single new species.
    In practice though, we are far from being able to confidently assess how much of the DNA in any given species is functional except in the case of very simple organisms.

  16. Scott Says:

    Michael: OK, you’re right. I should have said: we’ll never find any organism in evolutionary equilibrium, with mutation rate ε and K offspring per mated pair, with more than (log2(K)-1)/(8388608ε) MB of functional DNA.

  17. Job Says:

    Scott, can you tell where i went wrong in my reasoning:
    There are three operation types:
    – insert
    – copy
    – delete

    So we need two bits to encode the operation.
    Insert receives a bit to insert and a location in which to insert. Copy receives two locations for specifying the region that should be copied and another location for specifying where to copy it to.
    Delete receives two locations, for specifying the region that should be deleted.

    Each location is log(log(n)) bits so we have:
    insert 1 + log(log(n)) bits
    delete 2*log(log(n)) bits
    copy 3*log(log(n)) bits

    Your algorithm would generate a unique sequence of operations that produces a given string S, in n/logn time. So there are at most n/logn operations.

    That leaves the total amount of bits needed to represent the operations to build S at:
    2*3*log(logn)*n/logn
    (If we assume all operations are of type copy, worst case)

    Since logn grows faster than 6*log(logn), then 6*log(log(n))*n/logn is always less than n for n > c, for some c.

    That would mean that we can encode any n bits (where n > c) into fewer bits, which isn’t possible.

    I think a tight lower bound would probably be n/log(logn).

  18. Scott Says:

    Each location is log(log(n)) bits

    There’s your mistake right there. Each location is log(n) bits. 🙂

  19. Job Says:

    Ah, i don’t know why i was thinking log(log(n)).

  20. Patrick Cruce Says:

    “And on top of that, we can’t have the error-correcting code be too good, since otherwise we’ll suppress beneficial mutations!”

    It’s actually a common misconception that biological systems should have mechanisms that allow a certain number of mutations for the purpose of accruing beneficial adaptations. From an evolutionary perspective, all genes should favor the highest possible fidelity copies. Any genes that have any lower copying fidelity will necessarily have fewer copies in future generations and thus lose the evolutionary game to a gene with higher copying fidelity.

  21. Andy D Says:

    So how powerful is the reversal operation, really?

    Suppose we can produce string x in k applications of operations (1)-(4). I claim we can do it in 4k + 1 operations, without reversals. (Of course, this will show that the Lempel-Ziv approach is also a constant-factor approximation in our setting.)

    First, note that, if you can produce x in k steps with (1)-(4), you can also produce x^R, i.e. the reversal of x, in k steps with (1)-(4), by ‘flipping’ each operation.

    Second, note that you can produce x.x^R, their concatenation, in 2k steps by running both productions in parallel, interleaving steps on the two halves. If the i’th step in the original production of x was s, then the 2i’th step in the new production will be s.s^R.

    Now, in this ‘joint’ production we eliminate reversals as follows: say in the production of x, at the i’th stage we wish to move from

    s = a.b.c to s’ = a.b^R.c …

    well, in the joint production, at stage 2i we’ve got the string

    a.b.c.c^R.b^R.a^R

    By two copy and two deletion operations, we reach

    a.b^R.c.c^R.b.a^R,

    as needed to advance both sides’ progress.

    Finally, when the joint production’s finished, delete the right-hand side to yield x. That’s 4k + 1 steps.

  22. Nick Tarleton Says:

    From an evolutionary perspective, all genes should favor the highest possible fidelity copies.

    Hmm… Suppose there are two separated populations, identical except that one has a gene that makes the mutation rate negligibly low. Naturally the mutating population will acquire greater variation over time. If the environment shifts, the homogeneous population may be wiped out but part of the diverse population may survive. So in this case, lower-fidelity copying is more fit in the long run. This is highly contrived, of course.

  23. Job Says:

    Nick, that’s true but the diverse population is more likely to acquire mutations that are either neutral or harmful (rather than beneficial) even in the new environment. And meaningful environment shifts probably don’t happen enough to benefit a diverse population.

    Also a population that’s more subject to mutations is more likely to obtain beneficial mutations than another population, but it’s also even more likely to lose any beneficial mutations it has acquired over time, considering that most mutations are neutral or harmful.

  24. Carlos Says:

    @Job#23, regarding beneficial vs harmful mutations:

    Given a sufficiently small mutation, an organism is as likely to benefit from it as it is to be harmed by it. Think of moving a point (x_0, … x_n) an infinitesimal in some direction vector, and looking at the smooth (yes, that’s the big assumption) fitness function of this point. To first order, the change in fitness will be determined by the dot product between the gradient of the fitness function at the point and the direction vector. Half of the vectors produce beneficial changes, half of them harmful. This is a point that Dawkins makes over and over again in his books.

  25. such.ire Says:

    “From an evolutionary perspective, all genes should favor the highest possible fidelity copies.”

    This is not true. Look at bacteriophages; they all have very high error rates, so that they can adapt quickly, but not so high the mutational fluctuations prevent them from stably occupying fitness maxima. As an example, Phi-X174 is a virus with no mismatch repair abilities, because its wildtype has no GATC sequences (which are needed for the host mismatch repair machinery). The complete lack of GATC sequences in the virus implies a strong selection against the existence of mismatch repair.

    And HIV essentially prevails against humans because of its fast evolutionary time scale, due to its extremely high mutation rate. It rides the error catastrophe threshold.

  26. Craig Helfgott Says:

    I think that what Eliezer is really saying is that we may need a better metric for the amount of information contained in DNA (as opposed to the naive 2*# of bases measure). If there’s a SNP (single-nucleotide polymorphism) that doesn’t significantly impact survival, does it really represent “meaningful DNA”?

    (I would argue that it does; SNPs can cause visible changes in phenotype that are nevertheless not significantly selected for/against. Eliezer seems to be claiming otherwise. It would be interesting to ask him whether he think the genes coding for eye color are “meaningful”).

  27. Craig Helfgott Says:

    Okay, actually read the article, not what he’s saying.

  28. Markk Says:

    “because only O(1) errors per generation can be corrected by natural selection”.

    I’m having a hard time understanding this. In sexual reproduction for example, you could have any number of errors corrected in one generation. For example, ALL the errors on the Y chromosome will be “corrected” if the offspring is a female. Similarly for any stretch where mixing occurs. I guess if the multiplier is about half of the genome this is “O(1)” but then it becomes meaningless. I must be missing something.

  29. Scott Says:

    Markk: firstly, we’re talking about the entire gene pool, not about a specific organism.

    The point is this: suppose there are n mating pairs each with k offspring, and suppose for simplicity that the population is in a steady state. Then of the kn offspring, 2n are going to survive. So, in a certain information-theoretic sense, the number of bit-errors in the gene pool that natural selection can correct in one generation will be

    log2(kn / 2n) = log2(k) – 1.

    For example, suppose there are t bits in the genome that should be set to 0, but that (having mutated) are now completely random from one organism to the next. Then natural selection, by “choosing” 2n out of the kn organisms to survive, can only fix this problem if

    t ≤ log2(k) – 1.

    In reality, of course, it won’t be the same t bits that mutate in every organism. But one can show that that doesn’t affect the conclusion.

  30. Andy Says:

    I believe I can show that, if we’re producing a string x of length n, we can keep the intermediate strings to length at most 2n, while incurring at most a factor-2 blowup in the number of steps used.

    As Scott suggested to me in conversation, let’s keep track of which symbols in the intermediate workstring will and won’t make it into the final string x without deletion. If they won’t, call them ‘nonterminals’. (If a symbol will be replicated, and at least one of its descendants will be included in the final string, it’s not considered a nonterminal.)

    First note that, if in a production of x there appears at some stage a block of consecutive nonterminals, we can modify the production so that they are treated as a ‘unit’ in subsequent steps, i.e. no string operation takes an endpoint that’s properly inside the block. They get shuffled around together and eventually destroyed together. The production’s length is not affected.

    In a stage of such a production, for analysis purposes, we call a maximal block of nonterminals a ‘garbage monster’. Each time a new garbage monster is created (and many may be created by a single copy operation–even if there were no nonterminals in the original substring), we give it an unused, goofy monster name.

    When string manipulations cause two garbage monsters to become adjacent, we combine them under one of their two names (and say that that one ‘eats’ the other one).

    Key claim: in our modified type of production, at most one garbage monster gets eaten or destroyed at every step! Draw some pictures to see if you agree.

    Now, with this fact in hand, we can modify the production again and argue it gives what we want.
    Here’s the strategy: given a production of x of the type we described, modify the production so that between each step, we stop and delete all garbage monsters, one deletion per monster.

    This is a bit wasteful since some of these might’ve been eaten by other garbage monsters; but if our production had k steps to begin with, only

  31. Andy Says:

    …only at most k monsters got eaten total, and all garbage monsters must die somehow, so we only add at most k new steps.

    Finally, observe that the string is culled to length at most n after every new round of ‘executions’, and since the workstring’s length is at most doubled by any string operation, its length is bounded by 2n throughout.

  32. Job Says:

    I think that we would be able to show that we can’t do better than n/logn even when allowing for exponential-sized intermediate strings if we can show equivalence between a model that allows for for exponential-size intermediate strings with “complete addressing” and one that allows for exponential-sized intermediate strings with “incomplete addressing”.
    By “complete addressing” i mean the ability to act on any given index, and by “incomplete addressing” i mean the ability to act on at most a polynomial number of pre-specified indexes.

  33. David Speyer Says:

    I’ll put my questions related to Scott’s problem here and my questions related to Eliezer’s original post at Overcoming Bias:

    (1) It seems to me that the Ergün, Muthukrishnan, and Sahinalp paper also makes a testable prediction. Measure the genetic difference between two organisms by building a Lempel-Ziv dictionary from the first organism’s DNA and then seeing how many additional entries are added when we compress the second organism’s DNA. The prediction is that this difference always grows only linearly with time. If we ever observe this difference starting to jump up rapidly, we wil know that our model of how mutation occurs is too simplistic. Am I missing anything?

    (2) It would be nice to have a precise way to distinguish between “local” and “global” error correction mechanisms. Is there some sort of bound out there which says that, for example, if the state of each output bit only depends on k input bits, then we can’t do error correction better than some rate?

  34. Andy Says:

    The above proof, granted it’s correct, also implies an Omega(n/lg n) bound for producing most strings. To see this (an implication implicitly observed by Scott #12), note that we can encode any string by its method of production, and if the intermediate strings are all of length O(poly(n)), each step can be encoded with O(log n) bits. Since we’ve shown how to tranform productions to ensure this length bound with a constant-factor step increase, a string with a k-step description has a description of length O(k log n).

    Since most length-n strings have descriptive (‘Kolmogorov’) complexity ~ n, we must have k = Omega (n / lg n) for most strings.

  35. Andy Says:

    Here’s another question: supposing Nature wants to be able to perform global error-correction on chunks DNA, but has only these string-rewrite tools available, how efficiently can it do so?

    Formally, I’m asking for an instantiation of an good error-correcting code C from n-bit strings to O(n)-bit strings, capable of unique recovery from Omega(n) adversarial errors, where for all x, the transformation from x to C(x) can be performed by relatively few applications of string-rewrite operations (1)-(4)… hopefully o(n).

    (It seems clearly too much to ask for the decoding process to be that efficient, but encoding I’m not so sure.)

  36. Andy Says:

    David–there are linear codes, e.g. ‘Sipser-Spielman expander codes’, that achieve good parameters while each output bit depends on only O(1) input bits. In fact, these input bits can be randomly selected for each output, and you’ll still get a good code.

    Maybe there are some meaningful bounds on coding in this setting, but it’s not highly restrictive.

  37. David Speyer Says:

    I see. In that case, why do we think that the error correctring codes Nature uses aren’t up to the task?

  38. oz Says:

    Not sure if its relevant, but when aligning the genomes of several mammals, and looking at the size of regions which are conserved (i.e. they look more similar between different mammals than they would have been had they evolved neutrally), one finds that about 5% of the human genome is conserved, and thus believed to be ‘functional’, or ‘meaningful’.
    This gives 150MB which is more than 25MB but maybe in the same ball-park and can be achieved if you play with some parameters.

  39. Andy Says:

    David–I don’t think there’s any a priori reason Nature can’t do this, but as Scott says it’s not known to occur and there are examples where small mutations are catastrophic.

    Linear codes also use addition modulo 2 (or p), which is certainly physically realizable in many settings but is not part of the toolbox Scott posited for DNA manipulation. My comment 35 (which I guess reiterates part of what Scott was asking originally) is aimed at asking how efficient error-correction is possible in this setting. But I’m fairly agnostic/ignorant as to how apt the model we’re using actually is to describe the constraints on nature. For one thing, parallel time might be a better resource to track since cells have so many active components.

  40. Andy Says:

    Also, I haven’t read Eliezer’s post yet, but from what others say he may be focusing on ‘error correction’ as performed by natural selection on populations, as opposed to error correction within an individual cell/organism.

  41. Andy D Says:

    I’ve found a negative answer to the question in comment 35: if a cell wants to transform string x (of length n) into string C(x), where C(x) is any good error-correcting code, Omega(n/lg n) string operations are necessary. This is the case even if the rewriting system can look at x and determine its rewrite operations accordingly, via any rule/algorithm.

    A pessimistic finding, since C(x) can be constructed without even manipulating x in O(n/lg n) steps, using Scott’s ideas from comment 14. So DNA rewriting, as we’ve construed it here, seems weak in a number of ways.

    The writeup was somewhat longer this time, so I put it on my blog:

    http://andysresearch.blogspot.com/2007/11/goofball-dna-bonanza.html

  42. Markk Says:

    Re: the O(1) errors corrected per generation stuff.

    Fun stuff.

    Ok I see what you say but this doesn’t to me make a point supporting a limiting size on genomes with replication errors. It seems like it is assuming random survival. It doesn’t matter if there are errors if they are bad enough – there won’t be any offspring with them.

    In the example you gave if the “1” genes are really bad, ALL of the offspring that have them will die before birth (say), and be part of the Kn-2n deaths. Thus they ALL will be corrected in one generation. That is, no matter what the transcription error rate, you will never find reproductive age individuals with other than zeros. Its a perfect error correcting code. If they can be carried along generation upon generation, then why are you calling them “errors”? If the error rate is too high you will find no individuals of any type of course, but that would be determined not just by the error rate and size of the genome as the paper seems to conclude, but also by the size of “K”, the population to be corrected.

    That’s what bugs me about the whole thing actually. Transcription “errors” i.e mutations, are just one thing that makes changes in the genome and a small thing at that. Mixing creates many more changes to the whole genome than mutations every generation in sexual reproduction, and selection is more powerful than any error correcting code in that the information used to do the correcting is not “inside the system” so to speak. It is encoded in the environment. DNA doesn’t have to have super correction built in (although it does have some) it just has to get mixed around enough and have enough varieties produced to still have some organisms left when the pruning is done.

    I think there is a point to made that reproduction and mixing rates along with duplications of sequences (which is all over the place, and is in itself a kind of error correction, since transcription could be happening at several of these sites at once) must be high enough to overcome harmful transcription errors but there is no hard limit – it would be some function of size of population, size of possible offspring, mutation rate, mixing rate, etc. I am fairly sure that this has already been worked out, by population folks, Fischer or whoever generations ago.

    Whoops I was going to do an example showing that there is an exponentially small chance that all errors could be corrected by mixing in one generation also, but this post is too long already.

  43. Scott Says:

    Markk: No, you still don’t get it. Natural selection isn’t being ignored; on the contrary, it’s the only thing that lets us cope with any nonzero error rate. The point is that, if the error rate exceeds the natural selection rate, then not even natural selection can cope with the errors without killing off the whole population. That is, if every mating pair has 2k offspring, but of those fewer than 2 are free from crippling mutations, then the population will become smaller and smaller until it eventually disappears.

    Crossover doesn’t substantially change the conclusion. For one thing, almost all of the genome is identical across almost all of the population (you know that warm-fuzzy stuff about the whole human race being 99.8% genetically identical 🙂 ) So the “main” function of natural selection, for most base pairs most of the time, is not to innovate (say, by recombining genes in interesting ways) but just to protect against random errors. (This is something I learned just a couple days ago from Eliezer’s post, but is so obvious in retrospect I feel like I’ve known it my whole life. 🙂 )

  44. John Sidles Says:

    Since this thread is cooling down, maybe this is a good time to mention that Ed Wilson’s scientific autobiography Naturalist provides a wonderfully broad context for appreciating many of these information-theory issues.

    I heartily recommend this biography to anyone starting an academic career, in any discipline. In fact, it is pretty clear that this is the audience that Wilson was writing for.

  45. David Speyer Says:

    Markk,

    I think I finally understand the issue, so let me try to explain it to you. Disclaimer — I am a mathematician, not a biologist. There are still several points that I find unclear; I’ll describe them in a follow up post. Suppose that you have a herd of P cows, each of which has N base pairs of non-junk DNA. A huge proprotion of the non-junk DNA will be identical from one cow to the next, so there will be no diversity introduced there by recombination. Each mating pair produces an average of 2k offspring (so there are kP calves in total.) In each of these offspring, the probability that a base pair is incorrectly copied is e, which is very small. You are the farmer/natural selection/God. Your job is to keep the herd as stable as possible. But the only way you can do this is to kill off the bad copies; you can’t actually go in and fix the DNA. And you can’t kill off too many, or the population will decline. So, in each generation, you keep the best P offspring, out of the kP options available to you.

    Now, here is the key point. Suppose that the probablility that a calf is better than its parents is less than 1/k. Then you will lose. If less than 1/k of the offspring have errors that improve their fitness, then you will be forced to keep calves that were less fit than the preceding generation. Of course, you won’t keep losing forever. As the herd gets less healthy, the likelihood that an error will improve the herd increases. There will be an equilibirum point, where the probabilty of an improved copy is 1/k.

    Note: I’m simplifying here; I’m acting as if the herd has one given level of fitness. (Actually, my assumption is a little weaker than that; I am assuming that the fitness of the herd is clustered tightly enough around a single value that we may assume that the probability of improvement is independent of the level of fitness.) Really, there is a range of fitnesses, and we have to solve for an entire equilibirum distribrution. That makes the math much harder. If you believe that this is the key to the issue, I’m willing to tackle the harder problem, but please try to understand the simplified version first.

    Now, in fact, we know that life is very good at this task. There is almost no DNA variation within the non-junk DNA of a species. I claim that this means eN must be bounded. The exact bound depends on k, but not on N. Here is why. The expected number of new errors introduced by a given copying is eN. (More precisely, it is e(N-2B), where B is the number of errors already present, but B is much less than N.) In particular, this number is positive. The standard deviation of the number of errors introduced is (e (1-e) N)^{1/2}. So, in order for the copy to turn out better than the original, we need an event to occur which is on the order of (eN)^{1/2} standard deviations away from the norm. And we want this event to occur at least 1/k of the time. By the Chebyshev bounds this means that eN is O(k). (You should be able to do better than the Chebyshev bounds by analyzing the actual random process involved, but that will only change the constants and I’m lazy.)

  46. David Speyer Says:

    Now, a few things which still bother me.

    (1) Do we really know that nature can’t go in and fix the DNA? This strikes me as the biggest vulnerability here. If there were stages in the cell duplication process which fixed errors, this would seem to me to destroy the whole argument. And I don’t see (maybe Scott can explain this) why you would need sophisticated error correcting codes to do this; I think even a very crude encoding would work so long as the errors were actually undone and not merely compensated for.

    (2) As a number of commentors have pointed out, natural selection does not only kill off unfit offspring, it prevents them from being concieved in the first place through sexual selection. We can adjust the argument by saying that only the best rP of the herd get to breed (for some rstudies showing that certain regions, for which no function is known, are surprisingly well conserved; I couldn’t find out what would be a non-surprising level of conservation.

  47. David Speyer Says:

    Oh gosh, my second comment contained < symbols, which breaks wordpress. Scott, if you could please clean this up, I’d appreciate it. In the mean time, here is how the final two paragraphs should have read:

    (2) As a number of commentors have pointed out, natural selection does not only kill off unfit offspring, it prevents them from being concieved in the first place through sexual selection. We can adjust the argument by saying that only the best rP of the herd get to breed (for some r<1). These rP have kP offspring, of which the best P survive. I haven’t checked this, but my intuition is that we now get the bound eN=O(k/r) instead of eN=O(k). So eN is still bounded, but the bound might be significantly larger. Has anyone thought this through?

    (3) This is pure laziness on my part. Can anyone point me to numbers regarding how well DNA which is considered to be non-functional is conserved within a species? In a quick search, I found
    studies showing that certain regions, for which no function is known, are surprisingly well conserved but I couldn’t find out what would be a non-surprising level of conservation.

  48. Torbjörn Larsson, OM Says:

    one finds that about 5% of the human genome is conserved, and thus believed to be ‘functional’, or ‘meaningful’. This gives 150MB which is more than 25MB but maybe in the same ball-park and can be achieved if you play with some parameters.

    My understanding (albeit IANAB) is that it is roughly conserved, so near neutral variation will be included. (For example synonymous SNP’s encoding the same AA.)

    Also, as it is conservation with respect to functionality it will probably include drift from such sources, it isn’t necessarily all fixated variation such as it seems Yudkowsky discuss.

    These posts touch some interesting ideas. Some variation is needed to drive evolution and fixate traits, and I assume the copy error rate is itself selected for by average environment change. But AFAIU the given limit of “corrected” DNA (DNA under selective pressure) isn’t the same as “meaningful” base pairs.

    Dawkins makes an analysis of what “information” means for evolution (as an answer to creationists non-sequitur). Information is a relative property of a system, but by choosing measure he finds that with respect to selection it is information is fed into the gene pool of the next generation.

    Specifically it is information of how to survive:

    If natural selection feeds information into gene pools, what is the information about? It is about how to survive. Strictly it is about how to survive and reproduce, in the conditions that prevailed when previous generations were alive.

    To the extent that present day conditions are different from ancestral conditions, the ancestral genetic advice will be wrong. In extreme cases, the species may then go extinct. To the extent that conditions for the present generation are not too different from conditions for past generations, the information fed into present-day genomes from past generations is helpful information. [Italics original, bold added.]

    So the genome contains more or less meaningful remnants of earlier environments where the selective pressures were different. Presumably some of this gained information is locked in, like our body plans, by current selective pressures. However I assume it is not necessarily all of the locked in information that are directly maintained by Yudkowsky’s limit due to interdependencies.

  49. Torbjörn Larsson, OM Says:

    Can anyone point me to numbers regarding how well DNA which is considered to be non-functional is conserved within a species?

    Depends on what you mean with function and conserved.

    IIRC TR Gregory of Genomicron blog, an evolutionary biologist studying genome size, makes the point that you must specify functionality. Since genome size decides cell size, you can possibly find such function for all DNA.

    Also, eukaryotes have regulatory sites up- and downstream coding regions, and their positions are more or less preserved. And I believe that is also the case for eykaryotic introns, non-coding regions that are later excised.

    Btw, as I understand it introns makes for inefficient transcription and translation (taking time), and organisms that for some reason or other is evolving rapidly (read: fruit flies) seems to get rid of introns faster than they occur if I understand it correctly. So their “function” isn’t protecting them from evolution.

  50. Job (UG) Says:

    If we’re given a minimal sequence of K insert/delete/copy operations generating a string S, then what does that say about how fast we can implement f(i), a function that returns the ith character in S, on both a machine model M1 which supports insert/delete-range/copy-range and on a TM?