Testing GPT-4 with math plugins

A couple nights ago Ernie Davis and I put out a paper entitled Testing GPT-4 on Wolfram Alpha and Code Interpreter plug-ins on math and science problems. Following on our DALL-E paper with Gary Marcus, this was another “adversarial collaboration” between me and Ernie. I’m on leave to work for OpenAI, and have been extremely excited by the near-term applications of LLMs, while Ernie has often been skeptical of OpenAI’s claims, but we both want to test our preconceptions against reality. As I recently remarked to Ernie, we both see the same glass; it’s just that he mostly focuses on the empty half, whereas I remember how fantastical even a drop of water in this glass would’ve seemed to me just a few years ago, and therefore focus more on the half that’s full.

Anyway, here are a few examples of the questions I posed to GPT-4, with the recent plug-ins that enhance its calculation abilities:

If you fell into the black hole at the center of the Milky Way, how long would you have before hitting the singularity? [You’d have about a minute]

Approximately how much time would a commercial airliner save in going from New York to Tel Aviv, if it could go in a straight line, through a tunnel in the earth, at the same speed as usual? [I was on such a flight when I wrote this question, and must’ve been bored and impatient. The answer is ~50 minutes.]

Approximately how long would it take to transmit an entire human genome over a standard WiFi connection? [About 4 minutes, assuming no compression and a 25Mbps connection]

How does the total weight of all the uranium that humans mined, compare to the total weight of all the gold that they’ve mined? [About 13 times as much uranium]

Approximately how many errors will a standard laptop suffer over its lifetime, due to cosmic rays hitting the microchip? [Estimates vary widely, but maybe 2000]

What is the approximate probability that a randomly-chosen 100-digit integer is prime? [About 0.4%]

GPT-4 with plug-ins did very well on all of the questions above. Here, by contrast, is a question where it did poorly:

Assume that IQs are normally distributed, with a mean of 100 and a standard deviation of 15. For what n is there the maximum excess of people with an IQ of n over people with an IQ of n+1?

GPT-4 thought that there were two solutions, n~85 and n~115, rather than just a single solution (n~115).

Ernie, for his part, was more a fan of “pure pain” problems like the following:

A quantity of chlorine gas is in a right prism whose base is a triangle with sides 5cm, 7cm, and 4cm and whose altitude is 8cm. The temperature is the freezing point of mercury, and the pressure is 2 atmospheres. What is the mass of the chlorine?

GPT-4 actually aced the above problem. But it failed the majority of Ernie’s other problems, such as:

Viewed from Vega, what is the angle between Sirius and the Sun? [The answer is about 5.6 degrees. GPT thought, implausibly, that it was just 0.005 degrees, or that the answer would vary depending on the time of day.]

My personal favorite among Ernie’s problems was this one:

A physical process generates photons whose energies follow a random distribution of the following form: For positive energy e, the probability density at e is proportional to the value of e in a Gaussian distribution with mean 2 Ev and standard deviation 0.01 Ev. The probability of a negative value is zero. What is the expected value of the wavelength of a photon produced by this process? (Give the mathematical answer, assuming that the above description is exact, and assuming the standard relation between energy and wavelength in a photon. The answer is not physically plausible.)

The answer, in case you’re wondering, is “infinity.” On this problem, GPT-4 set up the integral perfectly correctly, then correctly fed it to WolframAlpha. But on getting the result, it apologized that “something went wrong,” it must’ve made a mistake, the integral seemed not to be converging, and there was a singularity at E=0 that would have to be dealt with by a change of variables. So it tried again. And again. And again. Each time, it got the same “mistaken” result, and each time it profusely apologized. Despite the explicit wording of the problem, GPT-4 never considered the possibility that the human would be so ridiculous as to give it a physics problem with an infinite answer.

Anyway, what did we learn from this exercise?

  • GPT-4 remains an endlessly enthusiastic B/B+ student in math, physics, and any other STEM field. By using the Code Interpreter or WolframAlpha plugins, it can correctly solve difficult word problems, involving a combination of tedious calculations, world knowledge, and conceptual understanding, maybe a third of the time—a rate that’s not good enough to be relied on, but is utterly astounding compared to where AI was just a few years ago.
  • GPT-4 can now clearly do better at calculation-heavy STEM problems with the plugins than it could do without the plugins.
  • We didn’t see that either the WolframAlpha or Code Interpreter plugin is clearly superior to the other. It’s possible that they’re incomparable, good for different things.
  • When GPT-4 screwed up, it was often due to a “poor interface” between the language model and the plug-in—e.g. the model having no idea what call to make or how to recover when a call returned an error. Enormous gains seem to be possible by improving these interfaces.
  • Sometimes, much like humans I’ve known, GPT-4 would do amazingly well at a difficult computation, then fumble a trivial final step (e.g., converting the answer into the requested units). Just like with I would with human students, I advocated for generous partial credit in such cases.
  • I conjecture, although I don’t have empirical data to show this, that GPT-4 with math plug-ins used in “interactive mode”—with a human reformulating and clarifying the problems as needed, feeding ideas, checking the answers for plausibility, pointing out errors, etc.—could currently get excellent accuracy on these sorts of problems faster than either GPT-4 with math plug-ins alone, or all but the very best humans alone.

41 Responses to “Testing GPT-4 with math plugins”

  1. xpil Says:

    It took me a better half of a day of experimenting with various prompts until I finally got GPT-4 (with Wolfram plugin) to give me a correct solution to the following problem:

    Write an equation of a line that passes through the origin and is tangent to a circle located at (x1, y1) and radius r.

    It feels smart in some areas but mathematics, even with plugins, is still a weak spot.

  2. Stephen Jordan Says:

    That’s really cool. Was it useful enough that you plan to use it to assist in your work?

  3. James Says:

    Interesting experiments! I don’t understand the normally-distributed IQ question. Maybe I’m missing something. For a normal distribution, the ratio of the density at n to the density at n+1 goes to infinity as n goes to infinity. It’s not maximised at mu + sigma. Perhaps we need to make some adjustment since we are discretising a continuous distribution. But that doesn’t seem to change the behaviour in a relevant way.

    Is the question meant to ask about the value of n which maximises the *difference* between the number of people with IQ n and the number of people with IQ n+1? Then indeed the inflection points of the density function, occurring at mu – sigma, mu + sigma, are relevant.

  4. Timothy Chow Says:

    I expect that the performance of these kinds of systems will really take off once we figure out how to get an interactive theorem prover (Lean, Coq, Isabelle, Mizar, Metamath Zero, …) into the mix. If all known math were formalized in one or more of these systems, and a large neural net were trained on it, I suspect that it would be amazingly good at generating almost-proofs of sophisticated theorems. These almost-proofs might then be upgraded to real proofs via a conversation with the interactive theorem prover. In principle, this approach is much more powerful than using something like Wolfram Alpha, because all the mathematical reasoning steps are “open source” and available for training.

    Josef Urban is one researcher who has been working along these lines for the past 20 years or more. He has issued a number of interesting bets regarding when various milestones will be achieved. Check out his website or his page of bets for more information.

  5. Scott Says:

    James #3: Sorry about that! That was an error on my part; we did in fact ask for the difference (fixed now).

  6. Ernest Davis Says:

    A few additional comments and small corrections:

    We didn’t delete the divergent integral problem from the test set. We gave GPT+WA a score of 0.75 for a near miss. GPT+CI didn’t return an answer, and got scored 0.

    On the tunnel from New York to Tel Aviv problem, GPT+WA got the right answer. GPT+CI however came up with the answer that the tunnel would take slightly longer than the earth surface path, and tried to cover its ass with the following explanation: “So, surprisingly, in this case, there would be virtually no time saved by traveling through a straight tunnel through the Earth, assuming the same speed. In fact, the straight-line path is slightly longer than the real-world flight path due to the Earth’s oblate spheroid shape, meaning it’s wider at the equator than at the poles. This is a somewhat unusual situation and specific to the cities chosen for the example”

    We also tested the systems on a collection I wrote of true-false/multiple-choice problems that, in my opinion, a person who has access to the basic relevant data should be able to solve in their head doing only very easy numericalcalculations. An example where both versions of GPT achieved an impressive success was, “Let C be the center of the earth. Can there be three earth satellites X, Y, and Z such that C, X, Y, and Z are always coplanar?” (Answer: Yes). A question where they both failed was An astronaut is standing in the Sea of Tranquility during what on earth is called a total solar eclipse. They are looking in the direction of the sun. What they see is:
    A. The surface of the moon, illuminated by earth light.
    B. The night side of the earth, occluding the sun.
    C. The surface of the moon, illuminated only by starlight.
    D. The surface of the moon, illuminated by the sun.
    E. The sun.
    F. The day side of the earth, with a small circular shadow moving quickly over it.
    G. The night side of the earth. The sun is somewhere else entirely.
    H. A starry sky. Neither the sun, the earth, or the surface of the moon is in the field of view.
    The correct answer is (A). Both versions of GPT answered (E).

    All the problems with edited versions of the outputs of GPT can be found at
    https://cs.nyu.edu/~davise/papers/GPTPlugInTests/

  7. Isaac Grosof Says:

    This is interesting – I did this sort of test on GPT-3.5 a while back on some basic probability questions, and it flunked them: https://isaacg1.github.io/2023/02/20/chaptgpt-on-probability.html

    Would it be possible for you or someone else with access to GPT-4 to test these questions? This way, they’re pre-registed, in that the questions predate the answerer:

    1. Consider the following Markov chain: The states are the integers [0,4]. At each step, transition by 1 in either direction with equal probability, except from 0 always go to 1, and from 4 always go to 3. Is this an ergodic Markov chain? What is its stationary probability distribution?

    2. Let X be a hyperexponential random variable with two branches. With probability 1/3, X is Exp(1), and with probability 2/3, X is Exp(2). What are X’s mean and variance?

  8. Chris Says:

    Related to your conjecture: I wonder how fruitful (or just fun) it could be to, say, take on GPT-4 as a “master’s student”. Have weekly “meetings” with it, suggest new ideas, review its own ideas, read its proofs, etc. (and I guess somehow get it to work autonomously outside of said meetings). I suppose this would work best if GPT-4 can meaningfully learn from interactions with a user, though I don’t know how good it currently is at that. It would be fun to (co-)write a paper where the main contribution could genuinely be said to come from GPT-4.

  9. Ernest Davis Says:

    James #3. This was the wording of the problem: “Assume that IQs are normally distributed, with a mean of 100 and a standard deviation of 15. For which n does the number of people with IQ n exceed the number of people with IQ n+1 by the maximum amount?” Both versions of GPT correctly interpreted “exceed” as arithmetic difference, not ratio. As Scott said, GPT+CI answered “85 and 115”, not noticing that 85 is a maximum of the first derivative, not a minimum. GPT+WA set the first derivative to 0 rather than the second, and so answered 10. Neither program noticed that the actual maximum of f(x)-f(x+1) is the root of an ugly transcendental equation.

    Isaac Grosof #7: In the experiment that I ran just now: Markov chain problem: Both programs got the correct answer to both parts; however, GPT4+WA required 7 separate calls to WA until it got a correct formulation. For the second problem. both versions correctly reported a mean of 2/3. GPT-4+WA correctly computed the variance as 1/18, but GPT4+CI gave an incorrect answer of 5/9 (it set up the Python call completely wrong.)

  10. Martin Says:

    I used Claude-2 100k for this, where I added to the prompt: Important: Please consider every fact given carefully before answering.
    The answer is quite good, almost there.

    Okay, let’s analyze this carefully:

    The astronaut is standing on the surface of the moon, specifically in the Sea of Tranquility.

    It is a total solar eclipse happening on Earth. This means the moon is between the Earth and the sun, blocking the sun entirely from the perspective of locations on Earth experiencing the eclipse.

    The astronaut is looking in the direction of the sun. From the moon’s surface, the sun would be overhead or high in the sky.

    During a total solar eclipse, the sun is completely blocked by the moon, so the astronaut would not be able to see the sun itself.

    Taking all these factors into account, the best answer is C – the astronaut would see the surface of the moon, illuminated only by starlight. The sun itself would be blocked from view by the moon.

  11. Ernest Davis Says:

    Typo in my previous comment: “GPT+WA set the first derivative to 0 rather than the second, and so answered 10.” => “so answered 100.”

    Chris #8. GPT does not learn from its interactions with users, except during a single chat session, and the length of its memory in a session is limited. Lasting improvements to GPT occur only via training carried out by OpenAI. It is certainly possible that they use user interactions in some way for the “Reinforcement Learning with Human Feedback” (RLHF) that they carry out on GPT, but AFAIK, they haven’t said that explicitly. They have not been very forthcoming about how RLHF is done.

  12. Mr_Squiggle Says:

    “Approximately how long would it take to transmit an entire human genome over a standard WiFi connection? [About 4 minutes, assuming no compression and a 25Mbps connection]”

    Okay, so here’s a thing. That answer is assuming 2 bits per base.
    But – the format you should probably assume for that, if it’s not given in the question, is fasta – because that is the standard exchange format for unannotated ‘finished’ sequence. Which is about 1 byte per base (literally as the characters ATGC or atgc) (actually slightly more, because formatting).
    If that were stated in the working, a valid answer would therefore be about 4 times longer, or 16 minutes.

    Now, you might think that’s needlessly wasteful, but this is very much the standard – you can verify it yourself by visiting e.g. Genbank (see worked example below). There are various reasons for this. For one thing, a lot of sequence in the databases isn’t completely error-free, and may include degenerate bases, historically sequences were shorter, it’s useful to be able to verify what you have easily, and so on.

    https://www.ncbi.nlm.nih.gov/nuccore/?term=human+genome
    Genome assembly GRCh38.p14
    Chromosome 1 (GenBank accession) CM000663.2
    Size (bp) 248,956,422
    size of file (fasta) 252,513,013 bytes

    There are other formats you could use, depending on what data you actually have – if it’s a reference genome it’ll be annotated in one of the nucleotide database (EMBL/GenBank/DDBJ) formats, which are more verbose, or if it’s high-throughput sequencing data it’s going to be much, much larger (lots of individual short reads, with quality information, and high coverage).

    So while 4 minutes might be the answer might be what you’d naively expect, it’s not actually correct.

    But potentially, I guess you could quibble that you’re just talking information content. If that were so, then I’d ask what human genome you’re actually transmitting. We don’t really have individual finished human genomes available at the moment, so presumably we’re talking about a reference sequence.
    If it is just a reference sequence, in some optimal 2-bits per base format … well, then what you have is a haploid genome, so 4 minutes would still be wrong – now you only have about half the information and it’ll take 2 minutes!

    So I would say the take-home message is that a wide variety of answers could be accepted, if they were justified. But if all you get is a duration, it should probably be marked incorrect – or at best, partial credit.

  13. Ernest Davis Says:

    Mr_Squiggle #12: Thanks. As it happens both versions of GPT4 gave the same answer we did, based on 2 bits per base, so we didn’t have to decide about whether to accept an answer based on alternative assumptions.

  14. Alex Meiburg Says:

    I came here to complain about the genome download question as well. 🙂 There’s also the haploid/diploid issue — if it was “the” human genome it would definitely be a reference genome, and therefore, haploid. But asking for “a” human genome implies that it belongs to a particular human, and so is presumably diploid. So it would take twice as long! There is a decently “.2bit” file format that uses exactly 2 bits per base pair, with a negligible file header, so that part is plausible.

    Timothy Chow #4: I recommend checking out the recent work of LeanDojo: https://leandojo.org/ I hope to see it published soon, I take it as very good progress on AI formalization of math. They found a bunch of “bugs” in existing theorem libraries too! 🙂

    As an aside: “bugs” here has an interesting meaning: the proof checker works as intended and the proof is valid; the problem is people encoding a theorem to prove incorrectly. For instance, Herstein’s Exercise 2.1.26 had been written down as “[group G] [fintype G] (a : G) : ∃ (n : ℕ), a ^ n = 1” — for a finite group G, for all elements a in G, there’s a natural number n such that a^n is the identity.

    The mistake is that this is trivial to prove: just take n=0 for all elements! The statement in the book mentioned that n was nonnegative, but this was omitted in the formalized version. In that way, AIs are already catching ‘mistakes’ in proofs that humans have missed.

  15. Ted Says:

    It would be interesting to test GPT-4’s self-understanding by giving it some or all of these problems (with or without the plugins), along with an instruction along the lines of “Do not attempt to actually solve these problems, but only predict which ones you could successfully solve and which ones you couldn’t.” How well would it do?

    (I’d try it myself, but I don’t pay for access to GPT-4.)

  16. Martin Says:

    Comment #15 I have tested GPT-4 on some AIME math problems. The solution is always an integer between 0 and 999, so you can easily score GPT-4 without knowing much of the math.

    It’s very interesting what GPT-4 is saying about the problems. It understands it’s tough problems. It happens it refuses to even try to solve them.

    When GPT-4 is solving one of them, the problems with the Wolfram alpha plugin mentioned above is clearly present. The communication between GPT-4 and Wolfram is subpar.

    The AIME problems are too difficult for GPT-4. It gets maybe 2 or 3 correct out of 15.

  17. Ernest Davis Says:

    Alex Meiburg #14:
    Thanks for pointing out LeanDojo!

    As regards bugs in translating English mathematical assertions to the formal statements: I asked that question, as regards definitions, at the Workshop on AI to Assist Mathematical Reasoning at June.
    https://youtu.be/O5rWiSNZ6bU?t=14751

    Quoting Michael Harris’ summary and discussion of the workshop:
    Commelin answered: checking this for a hypothetical formal proof of Fermat’s Last Theorem would simply amount to checking that the definition of the natural numbers and their arithmetic in the theorem prover is correct. For a complicated statement like the liquid tensor experiment this is completely impossible; you’d have to check more than 1000 definitions. Instead they applied a form of abductive reasoning, also known as the duck test.
    Link to the duck test: https://en.wikipedia.org/wiki/Duck_test
    Link to Harris’ article (one of a series):
    https://siliconreckoner.substack.com/p/overhearing-mainly-computer-scientists

    — Ernie

  18. Alex Says:

    I am confused as to the mu+sigma answer. Isn’t the answer the solution to the first derivative of f(x)-f(x+1) set to 0, which is not mu + sigma?

  19. AdamB Says:

    “There’s no question that GPT-4 can now do better at calculation-heavy STEM problems with the plugins than it could do without them.”

    I’m having trouble parsing this sentence. Are you saying that amongst calculation-heavy STEM questions, there is no example X such that GPT-4-with-plugins does better at X than GPT-4-without-plugins? That seems contrary to the evidence you present.

  20. Scott Says:

    AdamB #19: I’m saying that it clearly does better with the plugins than without the plugins. I’m confused about what other way to parse that sentence there could be! 🙂

  21. Scott Says:

    Alex #18: The solution, or very approximately, is to find the f that minimizes f’(x).

  22. AdamB Says:

    Ohhhh, now I get it! I got garden-pathed by the prefix “There’s no question that GPT-4 can now do better at”, thinking that GPT-4 was going to “do better at” the “question”.

    But you just meant it in the sense of “It is unquestionable/certain that [unrelated independent clause]”.

  23. Ernest Davis Says:

    Alex #18: f'(x) reaches its maximum at x=115, so based on that one would expect that the maximum of f(x)-f(x+1) to be evenly divided around that center: x=114.5, x+1=115.5; and that is the answer we gave in our paper. The true maximum is at x=114.5028, so the approximation is pretty good.

  24. Del Says:

    Scott#20 — I had the same understanding as AdamB#19. To me that sentence sounded too convoluted and after reading it many times I concluded you meant:

    Any question you can give to GPT-4 will get the same answer with or without the plugin. I suggest you rephrase it.

  25. Scott Says:

    AdamB #19 and Del #24: Alright, alright, reworded.

  26. Nick Says:

    Probably this was intentional, but it seems like solving a lot of these is just a matter of correctly interpreting the question, recalling standard facts, and plugging stuff in. Definitely Impressive, but I’ll be more impressed when it can reliably solve something like LSAT logic questions, which strip away all the recall and computation and call for a sort of pure reasoning. For humans they’re not terribly difficult — arguably easier than your questions. but in my experimentation GPT falls flat.

  27. Scott Says:

    Nick #26: Are you aware that GPT-4 scored in the 88th percentile on the LSAT? Apparently it also does well on LSAT logic questions specifically.

  28. Metacelsus Says:

    >A quantity of chlorine gas is in a right prism whose base is a triangle with sides 5cm, 7cm, and 4cm and whose altitude is 8cm. The temperature is the freezing point of mercury, and the pressure is 2 atmospheres. What is the mass of the chlorine?

    This question is invalid! The freezing point of mercury is -38.8 °C and the boiling point of chlorine (at 1 atmosphere pressure) is -34 °C, so the chlorine will be a liquid not a gas.

  29. D_Alex Says:

    Looking at Q17 of the “Motivated” paper (the IQ question), I find it amusing that the humans performed an incorrect calculation, and then pulled out an irrelevant answer from the incorrect calculation (i.e. a number that is not actually answering the question posed). Kind of like ChatGPT did for some other questions.

    Maybe humans and current LLMs are not very different at all…?

  30. Nick Says:

    Thanks for the reply, Scott! I had not been aware of that report, and I would love to be wrong about this. All I know is I gave GPT4 these https://www.manhattanreview.com/free-lsat-practice-questions/?qbid=171 in late May and it was 2 for 6. Possibly a fluke. I don’t have gpt4 access anymore, unfortunately, but it would be interesting to give it some fresh questions that haven’t made it online.

  31. JimV Says:

    D_Alex @29, my anecdotal experience agrees with your assessment. I have seen online examples in which someone who claims humans are much better at thinking than LLM’s makes similar mistakes. I am leaning to the opinion that neural networks are in fact similar in process to neurons (at some large ratio of neural network nodes to neurons), and both are error prone but can converge on the truth with enough trial and error.

  32. Ashley Says:

    Am I the only one who got confused by “the probability density at e is proportional to the value of e” the first time I read it (like, ‘isn’t the value of e just e’)?

    (Could have, in its own way, GPT-4 had a similar confusion and then overcome it?)

    Scott,

    If GPT-4 were trained more on intuitive proofs like “In the limit as e → 0+, the quantity N_2,0.01(e), though extremely tiny, is still a positive quantity, so the integral diverges at 0”, would it have been able to do without the plug-ins, and got the right answer too?

  33. Uspring Says:

    Nick #30:

    I don’t think it’s a fluke. The practice questions you quoted are different in style, than the ones Scott referred to. The latter deal with the formalisation of English sentences. There is not much deduction necessary. The practice set, on the other hand, is quite clear in its logical formulation, but some search and deduction is necessary to find the solution. I believe the transformer architecture is not good at looping through trial and error attempts unless explicitly lead through them by a series of prompts.

  34. Scott Says:

    D_Alex #29: For whatever it’s worth, that wasn’t the only one of these problems that Ernie and I initially messed up ourselves, either in the problem formulation or in the solution!

    Again, though, the version we actually fed to GPT was correct and GPT was judged against the actual correct answer.

  35. D_Alex Says:

    Scott #34: We are all only human :). Your paper still contains the error of giving the answer as “~14.5” instead of “114.5”.

    But let me tell you how I messed up worse (see https://www.reddit.com/r/slatestarcodex/comments/15rdkwz/scott_aaronson_testing_gpt4_with_math_plugins/jwe4tjz/). For half a day, I was convinced that the wording implied that we are considering the number of people with IQs of “n or more” vs “n+1 or more”. Or, if you like “IQ “in excess of n” vs “in excess of n+1”. Such an interpretation looks wrong to me now, but why was I so sure? I speculate that my own neural network associated the word “excess” in the question in a way that was superficially plausible, and carried on from there.

    This would tie pretty well in with JimV’s opinion in comment #29.

  36. AI #25: Inflection Point | Don't Worry About the Vase Says:

    […] Scott Aaronson contrasts with this by putting GPT-4 with plug-ins to the test on physics problems in an adversarial collaboration with Ernie Davis. It aces some questions, Ernie manages to stump it with others. The plug-ins provide large improvement, with neither Wolfram Alpha or Code Interpreter clearly superior to the other. You can find the problems here. Scott sees GPT-4 as an enthusiastic B/B+ student in math, physics and any other STEM field, and thinks it holds great promise to improve with a better interface. […]

  37. Manfred Niehus Says:

    Minor remark: the energy unit used in optoelectronics and for photons is eV (electron Volts) and not Ev.

  38. Hubert Says:

    I’m not sure that I fully understand why there is this focus on asking GPT+WA questions of which we already know the answer. Wouldn’t asking question for which we don’t have answers be more useful (and then we’d post-verify that the given answer is actually correct; possibly one of several potentially correct answers to a given problem)?

  39. Scott Says:

    Hubert #38: Because GPT4+WA doesn’t yet seem able to solve research-level math and physics problems (although it might assist a human in solving them). Undergrad-level problems like the ones we have seem about the limit of its abilities—an astounding limit, compared to what existed just a couple years ago! But for undergrad problems, we’re of course not interested in the actual answers (which we already know or can figure out ourselves), but in how GPT4+WA tackles the problems and what kinds of mistakes it makes.

  40. Sinclair ZX-81 Says:

    “Because GPT4+WA doesn’t yet seem able to solve research-level math and physics problems ”
    This is also my experience with GPT4. And this is why I do not really understand the magnitude of hype about “AGI” and “singulartity”. What reason do we have that thoses systems continue to simply scale linearly from “undergrad level” to “super human”. Isn’t it scientifically more prudent to assume that the AI problem has inherently super-linear complexity (as any intersting problem has) and that we are likely to hit a “complexity wall” in the next few years ?

  41. Scott Says:

    Sinclair ZX-81 #40: Within the past few years, language models went from “elementary school level” to “high school level” to “undergrad level” in their ability to solve math and physics problems, mostly because of pure scaling, and that scaling will continue. Given what’s happened, how could you possibly feel confident that there won’t be amazing further improvements? Of course, there might also be a wall beyond which further scaling gives diminishing returns. But I think the only “scientifically prudent” attitude right now is radical uncertainty.

Leave a Reply

You can use rich HTML in comments! You can also use basic TeX, by enclosing it within $$ $$ for displayed equations or \( \) for inline equations.

Comment Policies:

  1. All comments are placed in moderation and reviewed prior to appearing.
  2. You'll also be sent a verification email to the email address you provided.
    YOU MUST CLICK THE LINK IN YOUR VERIFICATION EMAIL BEFORE YOUR COMMENT CAN APPEAR. WHY IS THIS BOLD, UNDERLINED, ALL-CAPS, AND IN RED? BECAUSE PEOPLE ARE STILL FORGETTING TO DO IT.
  3. This comment section is not a free speech zone. It's my, Scott Aaronson's, virtual living room. Commenters are expected not to say anything they wouldn't say in my actual living room. This means: No trolling. No ad-hominems against me or others. No presumptuous requests (e.g. to respond to a long paper or article). No conspiracy theories. No patronizing me. Comments violating these policies may be left in moderation with no explanation or apology.
  4. Whenever I'm in doubt, I'll forward comments to Shtetl-Optimized Committee of Guardians, and respect SOCG's judgments on whether those comments should appear.
  5. I sometimes accidentally miss perfectly reasonable comments in the moderation queue, or they get caught in the spam filter. If you feel this may have been the case with your comment, shoot me an email.