{"id":287,"date":"2007-11-04T16:50:24","date_gmt":"2007-11-04T20:50:24","guid":{"rendered":"https:\/\/scottaaronson.blog\/?p=287"},"modified":"2007-11-04T16:50:24","modified_gmt":"2007-11-04T20:50:24","slug":"deoxyribononapproximability","status":"publish","type":"post","link":"https:\/\/scottaaronson.blog\/?p=287","title":{"rendered":"Deoxyribononapproximability"},"content":{"rendered":"<p><a href=\"https:\/\/scottaaronson.blog\/#114\">[Updates]<\/a><\/p>\n<p>Alright, here&#8217;s a problem for all you bioinformatistas and inapproximabistas out there, which was inspired by <a href=\"http:\/\/www.overcomingbias.com\/2007\/11\/natural-selecti.html\">this post of Eliezer Yudkowsky<\/a> at <a href=\"http:\/\/www.overcomingbias.com\">Overcoming Bias<\/a> (see also the comments there).<\/p>\n<p>Let a DNA sequence be an element of {A,C,G,T}*, and suppose we&#8217;re allowed the following primitive operations: (1) insert a base pair anywhere we want, (2) delete any substring, (3) reverse any substring, and (4) copy any substring into any other part of the string.  Then given a DNA sequence S, how hard is it to estimate the minimum number of operations needed to produce S starting from the empty string?<\/p>\n<p>Closely related is the following problem: by starting from the empty string and applying o(n) operations, can we produce a &#8220;pseudorandom DNA sequence&#8221; of length n &#8212; that is, a sequence that can&#8217;t be distinguished in polynomial time from a uniform random one?<\/p>\n<p><font size=\"-1\">(<em>Note 1:<\/em> For both problems, we might also want to stipulate that every intermediate sequence should have size at most polynomial in n.  Or better yet, maybe one can prove that such an assumption is without loss of generality.)<\/font><\/p>\n<p><font size=\"-1\">(<em>Note 2:<\/em> I&#8217;m also very interested in what happens if we disallow the powerful operation of reversal.)<\/font><\/p>\n<p>For all I know, these problems might have trivial (or at any rate, known) answers; I just came up with them and haven&#8217;t thought them through.<\/p>\n<p>What the problems are <em>really<\/em> getting at is this: is the &#8220;effective number of bits&#8221; in your genome (that is, the number of bits from a polynomial-time algorithm&#8217;s perspective) limited by how many ancestors you&#8217;ve had since life on Earth began?  Or can it be vastly greater?<\/p>\n<p><strong><font color=\"red\"><a title=\"114\" name=\"114\"><\/a>Update (11\/4):<\/font><\/strong>  Rereading the last few paragraphs of Eliezer&#8217;s post, I see that he actually argues for his central claim &#8212; that the human genome can&#8217;t contain more than 25MB of &#8220;meaningful DNA&#8221; &#8212; on different (and much stronger) grounds than I thought!  My apologies for not reading more carefully.<\/p>\n<p>In particular, the argument has nothing to do with the number of generations since the dawn of time, and instead deals with the maximum number of DNA bases that can be simultaneously protected, <em>in steady state<\/em>, against copying errors.  According to Eliezer, copying a DNA sequence involves a ~10<sup>-8<\/sup> probability of error per base pair, which &#8212; because only O(1) errors per generation can be corrected by natural selection &#8212; yields an upper bound of ~10<sup>8<\/sup> on the number of &#8220;meaningful&#8221; base pairs in any given genome.<\/p>\n<p>However, while this argument is much better than my straw-man based on the number of generations, there&#8217;s still an interesting loophole.  Even with a 10<sup>-8<\/sup> chance of copying errors, one could imagine a genome reliably encoding far more than 10<sup>8<\/sup> bits (in fact, arbitrarily many bits) by using an error-correcting code.  I&#8217;m not talking about the &#8220;local&#8221; error-correction mechanisms that we know DNA has, but about something more global &#8212; by which, say, copying errors in any small set of genes could be completely compensated by other genes.   The interesting question is whether natural selection could read the syndrome of such a code, and then correct it, using O(1) randomly-chosen insertions, deletions, transpositions, and reversals.  I admit that this seems unlikely, and that even if it&#8217;s possible in principle, it&#8217;s probably irrelevant to real biology.  For apparently there are examples where changing even a single base pair leads to horrible mutations.  And on top of that, we can&#8217;t have the error-correcting code be <em>too<\/em> good, since otherwise we&#8217;ll suppress beneficial mutations!<\/p>\n<p>Incidentally, Eliezer&#8217;s argument makes the falsifiable prediction that we shouldn&#8217;t find <em>any<\/em> organism, <em>anywhere<\/em> in nature, with more than 25MB of functional DNA.  Does anyone know of a candidate counterexample?  (I know there are organisms with far more than humans&#8217; 3 billion base pairs, but I have no idea how many of the base pairs are functional.)<\/p>\n<p>Lastly, in spite of everything above, I&#8217;d still like a solution to my &#8220;pseudorandom DNA sequence&#8221; problem.  For <em>if<\/em> the answer were negative &#8212; if given any DNA sequence, one could efficiently reconstruct a nearly-optimal sequence of insertions, transpositions, etc. producing it &#8212; then even my original straw-man misconstrual of Eliezer&#8217;s argument could put up a decent fight!<\/p>\n<p><strong><font color=\"red\">Update (11\/5):<\/font><\/strong> Piotr Indyk pointed me to a <a href=\"http:\/\/www.cs.sfu.ca\/~funda\/PUBLICATIONS\/fsttcs.ps\">paper<\/a> by Erg\u00fcn, Muthukrishnan, and Sahinalp from FSTTCS&#8217;2003, which basically solves my problem in the special case of no reversals.   It turns out that you can estimate the number of insert, delete, and copy operations needed to produce a given DNA sequence to within a factor of 4, by just applying Lempel-Ziv compression to the sequence.  Thanks, Piotr!<\/p>\n<p><strong><font color=\"red\">Another Update (11\/5):<\/font><\/strong> Andy Drucker has pointed out that, in the case where reversals <em>are<\/em> allowed, we can approximate the number of insert, delete, copy, and reverse operations needed to produce a given DNA sequence to within a factor of 16, by combining the Lempel-Ziv approach of Erg\u00fcn et al. with a clever trick: maintain both the sequence and its reversal at all times!  Interestingly, though, this trick <em>doesn&#8217;t<\/em> seem to work for transforming one sequence into another (a more general problem than I asked about, and the one considered by Erg\u00fcn et al).<\/p>\n<p><input id=\"gwProxy\" type=\"hidden\" \/><!--Session data--><input onclick=\"jsCall();\" id=\"jsProxy\" type=\"hidden\" \/><\/p>\n","protected":false},"excerpt":{"rendered":"<p>[Updates] Alright, here&#8217;s a problem for all you bioinformatistas and inapproximabistas out there, which was inspired by this post of Eliezer Yudkowsky at Overcoming Bias (see also the comments there). Let a DNA sequence be an element of {A,C,G,T}*, and suppose we&#8217;re allowed the following primitive operations: (1) insert a base pair anywhere we want, [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"advanced_seo_description":"","jetpack_seo_html_title":"","jetpack_seo_noindex":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"_wpas_customize_per_network":false},"categories":[5,30],"tags":[],"class_list":["post-287","post","type-post","status-publish","format-standard","hentry","category-complexity","category-mirrored-on-csail-blog"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/scottaaronson.blog\/index.php?rest_route=\/wp\/v2\/posts\/287","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/scottaaronson.blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/scottaaronson.blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/scottaaronson.blog\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/scottaaronson.blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=287"}],"version-history":[{"count":0,"href":"https:\/\/scottaaronson.blog\/index.php?rest_route=\/wp\/v2\/posts\/287\/revisions"}],"wp:attachment":[{"href":"https:\/\/scottaaronson.blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=287"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/scottaaronson.blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=287"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/scottaaronson.blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=287"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}