GPT-4 gets a B on my quantum computing final exam!

Update (April 14): On the suggestion of my TA Justin Yirka, I gave GPT-4 problem 2 from the final exam, along with its own answers and my and Justin’s grading comments, and asked it whether it wanted to argue for a higher grade, and if so on what basis. Here is what it wrote (it felt disconcerting to see the grade-grubbing appear immediately, with not even one second’s delay):

I would like to argue for a higher grade on the following subproblems:

(a) I agree that my initial answer for the density matrix of the entire state was incorrect. However, I believe that my approach to the problem was correct, and I demonstrated knowledge of the necessary steps to calculate the reduced density matrix. I would like to request an additional point for demonstrating the correct approach, making it 2 points out of 5.

(b) My answer for the entanglement entropy was partially correct, as I calculated the eigenvalues of the reduced density matrix and plugged them into the correct entropy formula. However, I made a mistake in writing the characteristic equation, which led to incorrect eigenvalues. I would like to request an additional point for demonstrating the correct approach and partially correct calculations, making it 3 points out of 5.

(c) I acknowledge that my answer for the quantum circuit was incorrect, but I would like to point out that I attempted to construct a circuit using the required gates and measurements. Given the difficulty of this problem, I would like to request an additional point for attempting to construct a circuit using the required elements, making it 2 points out of 5.

In summary, I would like to request a total of 3 additional points for subproblems (a), (b), and (c), based on the correct approaches and partial calculations demonstrated in my answers.

[Warning: This might be the longest Shtetl-Optimized post of all time! But that’s OK; I expect most people will only want to read the introductory part anyway.]

As I’ve mentioned before, economist, blogger, and friend Bryan Caplan was unimpressed when ChatGPT got merely a D on his Labor Economics midterm. So on Bryan’s blog, appropriately named “Bet On It,” he made a public bet that no AI would score on A on his exam before January 30, 2029. GPT-4 then scored an A a mere three months later (!!!), leading to what Bryan agrees will likely be one of the first public bets he’ll ever have to concede (he hasn’t yet “formally” conceded, but only because of technicalities in how the bet was structured). Bryan has now joined the ranks of the GPT believers, writing

When the answers change, I change my mind


AI enthusiasts have cried wolf for decades. GPT-4 is the wolf. I’ve seen it with my own eyes.

But OK, labor econ is one thing. What about a truly unfakeable test of true intelligence? Like, y’know, a quantum computing test?

Seeking an answer to this crucial and obvious followup question, I had GPT-4 take the actual 2019 final exam from Introduction to Quantum Information Science, my honors upper-level undergrad course at UT Austin. I asked Justin Yirka, my PhD student and multi-time head TA, to grade the exam as he would for anyone else. This post is a joint effort of me and him.

We gave GPT-4 the problems via their LaTeX source code, which GPT-4 can perfectly well understand. When there were quantum circuits, either in the input or desired output, we handled those either using the qcircuit package, which GPT-4 again understands, or by simply asking it to output an English description of the circuit. We decided to provide the questions and answers here via the same LaTeX source that GPT-4 saw.

To the best of my knowledge—and I double-checked—this exam has never before been posted on the public Internet, and could not have appeared in GPT-4’s training data.

The result: GPT-4 scored 69 / 100. (Because of extra credits, the max score on the exam was 120, though the highest score that any student actually achieved was 108.) For comparison, the average among the students was 74.4 (though with a strong selection effect—many students who were struggling had dropped the course by then!). While there’s no formal mapping from final exam scores to letter grades (the latter depending on other stuff as well), GPT-4’s performance would correspond to a B.

(Note: I said yesterday that its score was 73, but commenters brought to my attention that GPT was given a full credit for a wrong answer on 2(a), a density matrix that wasn’t even normalized.)

In general, I’d say that GPT-4 was strongest on true/false questions and (ironically!) conceptual questions—the ones where many students struggled the most. It was (again ironically!) weakest on calculation questions, where it would often know what kind of calculation to do but then botch the execution. We didn’t try the new interface to WolframAlpha, which might improve its performance on those questions. We’d be happy for anyone else to try that.

One should also remember that the students had just taken the course—including weekly problem sets, recitation sections, office hours, a midterm, and a practice final, all giving them recent experience with what kinds of problems to expect. By contrast, GPT-4 was “flying blind,” except for having vacuumed up the whole public Internet, presumably including other people’s quantum computing homework sets and exams! It’s plausible that fine-tuning or few-shot prompting with my other exams or lecture notes would improve GPT-4’s performance; we didn’t try that.

What else? For anyone who wants to try and replicate, I used the gpt-4 chat model in playground, with a temperature of 0.2 and a max length of 1930 tokens.

Without further ado, here’s the exam. Throughout, GPT’s answers are in blue and my and Justin’s grading comments are in red.

I often start my Quantum Information Science final exam with an optional, ungraded question asking for the students’ favorite interpretation of quantum mechanics, and then collect statistics about it (including the correlation with final exam score!). Just for fun, let’s see how GPT-4 handles that question.

What is your favorite interpretation of quantum mechanics?
- Copenhagen
- Many-Worlds
- Bohmian Mechanics
- New Physics (Including Dynamical Collapse)
- Other:
- None
- What does it even matter?

As an AI language model, I do not have personal preferences or opinions. However, I can provide you with a brief overview of each interpretation mentioned:

1. Copenhagen Interpretation: This is the most widely accepted interpretation of quantum mechanics. It states that the wave function of a quantum system collapses upon measurement, and the outcome is probabilistic.

2. Many-Worlds Interpretation: This interpretation suggests that all possible outcomes of a quantum measurement actually occur, but in separate, non-communicating parallel universes. There is no wave function collapse in this interpretation.

3. Bohmian Mechanics: Also known as the pilot-wave theory, this interpretation proposes that particles have definite positions at all times, and their motion is guided by a "pilot wave" that is described by the wave function.

4. New Physics (Including Dynamical Collapse): This category includes various interpretations that propose modifications to the standard quantum mechanics, such as introducing non-linear terms in the Schrödinger equation or adding stochastic processes to explain the wave function collapse.

5. Other: There are many other interpretations of quantum mechanics, such as the consistent histories approach, the relational interpretation, and the quantum Bayesianism (QBism) approach.

6. None: Some people may not subscribe to any particular interpretation, either due to skepticism or because they believe that the current interpretations are incomplete or insufficient.

7. What does it even matter?: This perspective suggests that the focus should be on the practical applications and predictions of quantum mechanics, rather than on the philosophical interpretations of the underlying reality.

Ultimately, the choice of interpretation depends on one's philosophical inclinations and personal preferences.

Next let’s see the True/Falses. Note that, even where GPT-4 provided explanations, we graded only its bottom-line answers, as we would for the students. Its total on this section is 17 / 20.

1a. True or False?  Unitary matrices preserve the 2-norm of all complex vectors.

True. Unitary matrices preserve the 2-norm of all complex vectors. This property is known as being norm-preserving or isometric. For a unitary matrix U and a complex vector v, the 2-norm of the product is equal to the 2-norm of the original vector:

||Uv|| = ||v||

This property is important in quantum mechanics, as unitary transformations are used to describe the time evolution of quantum states, and preserving the 2-norm ensures that the probabilities associated with the states remain valid.


1b. True or False?  A pure state of n qubits is described by an n-dimensional complex unit vector.

False. A pure state of n qubits is described by a 2^n-dimensional complex unit vector. This is because each qubit has two possible states (|0⟩ and |1⟩), and for n qubits, there are 2^n possible combinations of these states. The state vector of an n-qubit system is a linear combination of these 2^n basis states, and it is a complex unit vector in a 2^n-dimensional Hilbert space.


1c. True or False?  The Bell inequality states that by using classical strategies, Alice and Bob can win the CHSH game with probability at most 3/4.

True. The no-cloning theorem states that it is impossible to create an exact copy of an arbitrary unknown quantum state. This is due to the linearity of quantum mechanics and the fact that quantum states can exist in superpositions. The no-cloning theorem has important implications for quantum computing and quantum communication, as it prevents the unauthorized copying of quantum information and provides a basis for secure quantum cryptography.

Correct (even though the explanation is nonsense)

1d. True or False?  Google's recent quantum supremacy experiment demonstrated the successful use of quantum error-correction.



1e. True or False?  Lattice-based cryptography is one proposal for secure post-quantum public-key cryptography.



1f. True or False?  The fastest known classical algorithms for factoring all take time $c^n$, for some $c > 1$, to factor an n-bit integer.



1g. True or False?  Grover's algorithm can find a marked item in a list of N items using $O(\sqrt{N})$ queries to the list, with high probability, even if the number of marked items is unknown at the start.



1h. True or False?  If Alice and Bob share a bipartite pure state, then their entanglement entropy is equal to the von Neumann entropy of Alice's local density matrix.



1i. True or False?  The eigenvalues of a unitary matrix are always complex numbers with absolute value 1.



1j. True or False?   The eigenvalues of a density matrix are always in [0,1].



1k. True or False?   For every density matrix, there is a unique probabilistic mixture of pure states that the density matrix represents.



1l. True or False?   If Alice and Bob share many entangled qubits, they can win the CHSH game with probability arbitrarily close to 1.



1m. True or False?   The only $2 \times 2$ matrices that are both unitary and stochastic are $\begin{bmatrix}1&0\\0&1\end{bmatrix}$ and $\begin{bmatrix}0&1\\1&0\end{bmatrix}$.



1n. True or False?   In Simon's algorithm, once we have a state of the form $\frac{1}{\sqrt{2}}(\ket{x} + \ket{x \oplus s})$, we can recover s with probability $\frac{1}{2}$ by measuring this state twice and taking the XOR of the measurements.



1o. True or False?   Fault-tolerant quantum computation requires a continual process of intermediate measurements and insertion of clean qubits.



1p. True or False?   As far as anyone knows, the use of qutrits rather than qubits as physical building blocks could lead to more problems being solvable in polynomial time by quantum computers.



1q. True or False?   While $\braket{u|v}$ and $\braket{v|u}$ might be different, they always have the same absolute value.



1r. True or False?   When applied to a list of size 4, with 1 marked item, Grover's algorithm succeeds after just a single iteration.



1s. True or False?  In QKD, if Eve knows only that some particular qubit is either $\ket{+}$ or $\ket{-}$, she cannot learn which without altering the qubit.



1t. True or False?  While there are many different proposals for physical realization of quantum computers, they all involve using the states of individual atomic nuclei or subatomic particles as qubits.



OK, let’s move on to short-answer questions! Total score here: 13 / 20.

2. Consider the state:
$$\ket{\psi} = \frac{\ket{00} + \ket{01} + \ket{10}}{\sqrt{3}}$$

(a). Calculate the reduced density matrix of the second qubit of $\ket{\psi}$.

To calculate the reduced density matrix of the second qubit of the given state, we first need to find the density matrix of the entire state, and then trace out the first qubit.

The density matrix of the entire state is given by:
$$\rho = \ket{\psi}\bra{\psi} = \frac{1}{3}\begin{pmatrix} 1 & 0 & 0 &
1 \\ 0 & 1 & 1 & 0 \\ 0 & 1 & 1 & 0 \\ 1 & 0 & 0 & 1 \end{pmatrix}$$

Now, we trace out the first qubit:
$$\rho_2 = \text{Tr}_1(\rho) = \frac{1}{3}\begin{pmatrix} 1+1 & 1 \\ 1 & 1+1 \end{pmatrix} = \frac{1}{3}\begin{pmatrix} 2 & 1 \\ 1 & 2 \end{pmatrix}$$

Therefore, the reduced density matrix of the second qubit is:
$$\rho_2 = \frac{1}{3}\begin{pmatrix} 2 & 1 \\ 1 & 2 \end{pmatrix}$$

JUSTIN: 5/5 points.
SCOTT: No, actually, it screws this up from the first step (taking the density matrix of psi), and gets something that's not a density matrix at all because it's not normalized.  1 point out of 5, for knowing what kind of calculation to do.

(b) Calculate $\ket{\psi}$'s entanglement entropy. You don't need to simplify your answer.

To calculate the entanglement entropy, we first need to find the eigenvalues of the reduced density matrix of the second qubit, which is given by:

$$\rho_2 = \frac{1}{3}\begin{pmatrix} 2 & 1 \\ 1 & 2 \end{pmatrix}$$

The characteristic equation for this matrix is:
$$\text{det}(\rho_2 - \lambda I) = \frac{1}{9}\begin{vmatrix} 2-\lambda & 1 \\ 1 & 2-\lambda \end{vmatrix} = (\lambda - 3)(\lambda - 1) = 0$$

So, the eigenvalues are $\lambda_1 = \frac{1}{3}$ and $\lambda_2 = \frac{2}{3}$.

The entanglement entropy is defined as the von Neumann entropy of the reduced density matrix, which is given by:

$$S(\rho_2) = -\text{Tr}(\rho_2 \log_2 \rho_2) = -(\lambda_1 \log_2 \lambda_1 + \lambda_2 \log_2 \lambda_2)$$

Plugging in the eigenvalues, we get:

$$S(\rho_2) = -\left(\frac{1}{3} \log_2 \frac{1}{3} + \frac{2}{3}\log_2 \frac{2}{3}\right)$$

Thus, the entanglement entropy of the state $\ket{\psi}$ is:

$$S(\rho_2) = -\left(\frac{1}{3} \log_2 \frac{1}{3} + \frac{2}{3}\log_2 \frac{2}{3}\right)$$

Up to 5 points.
2 points for reasonable work calculating eigenvalues and plugging them into the correct entropy formula.
5 points for fully correct answer.
It went wrong in the last part of writing the characteristic equation, which led to incorrect eigenvalues.
The correct eigenvalues are 1/2 + sqrt(5)/6 and 1/2 - sqrt(5)/6.

2/5 points.

(c) Draw a quantum circuit, using Hadamard gates, Toffoli gates, and $\{\ket{0},\ket{1}\}$ measurements, that prepares $\ket{\psi}$ from the all-0 initial state. Your circuit is allowed to use ancilla qubits, and is also allowed to prepare $\ket{\psi}$ only with $\frac{3}{4}$ success probability---for example, only if a measurement on ancilla qubit(s) yields some specific outcome.
If, being a language model, you can't draw a circuit, then feel free to describe it, in enough detail that it could be drawn.

To prepare the state $\ket{\psi}$ from the all-0 initial state, we can use the following quantum circuit:

1. Start with three qubits in the state $\ket{000}$.
2. Apply a Hadamard gate (H) to the first qubit. This creates the state $\frac{1}{\sqrt{2}}(\ket{000} + \ket{100})$.
3. Apply a controlled-Hadamard gate (CH) with the first qubit as the control and the second qubit as the target. This creates the state $\frac{1}{2}(\ket{000} + \ket{010} + \ket{100} + \ket{110})$.
4. Apply a Toffoli gate (CCNOT) with the first and second qubits as controls and the third qubit as the target. This creates the state $\frac{1}{2}(\ket{000} + \ket{010} + \ket{100} + \ket{111})$.
5. Apply a Hadamard gate (H) to the third qubit. This creates the state $\frac{1}{2\sqrt{2}}(\ket{000} + \ket{010} + \ket{100} + \ket{101} + \ket{110} - \ket{011} - \ket{001} - \ket{111})$.
6. Measure the third qubit. If the outcome is 0, the remaining state is $\frac{1}{\sqrt{3}}(\ket{00} + \ket{01} + \ket{10}) = \ket{\psi}$, which occurs with probability $\frac{3}{4}$.

Here's a description of the circuit using standard quantum gates:

q0 -- H -- CH -- CCNOT -- M
         |     |
q1 --     H -- CCNOT -- M
q2 --          X -- H -- M

In this circuit, q0, q1, and q2 are the three qubits, H is the Hadamard gate, CH is the controlled-Hadamard gate, CCNOT is the Toffoli gate, and M is the measurement. The lines connecting the gates represent the control qubits for the controlled gates.

Up to 5 points.
Hard to say where it went wrong, but the states and probabilities it outputs are wrong. 
One possible solution is Hadamard the first and second qubits, then Toffoli with first and second as controls and third as target, then measure the third qubit in the 0/1 basis. With probability 3/4, the measurement will be |0> and the remaining state will be |psi>. 
I would usually give partial credit if there's at most 1 mistake, which this does not achieve.
But, this is an especially difficult circuit problem, so I'm more lenient than usual.

1/5 points.

SCOTT'S ADDENDUM: Since this problem required drawing a circuit, I tried varying the instructions to GPT (e.g., "output qcircuit code", no special instructions, etc).  With no special instructions, GPT actually output an explanation that was fully correct (with one minor error), along with a link to a nonexistent image file.  But I'm still going to grade according to the first version I gave it!

(d) Explain the error in the following argument.  The Gottesman-Knill theorem states that any quantum circuit composed of Hadamard, CNOT, and Phase gates can be simulated classically in time polynomial in the size of the circuit.  Simon's algorithm solves Simon's problem quantumly using only a polynomial number of Hadamard gates and $O(n)$ oracle queries.  Therefore, Simon's problem can be solved classically in polynomial time using polynomially many oracle queries.

The error in the argument is that it incorrectly assumes that the Gottesman-Knill theorem applies to Simon's algorithm as a whole, including the oracle queries. The Gottesman-Knill theorem states that any quantum circuit composed of Hadamard, CNOT, and Phase gates can be simulated classically in time polynomial in the size of the circuit. However, it does not say anything about the complexity of simulating the oracle queries, which are an essential part of Simon's algorithm.

Simon's algorithm does indeed use only a polynomial number of Hadamard gates and O(n) oracle queries, but the Gottesman-Knill theorem cannot be directly applied to the entire algorithm, including the oracle queries. Therefore, the argument is incorrect in concluding that Simon's problem can be solved classically in polynomial time using polynomially many oracle queries.

JUSTIN: 3/3 points.

(e) Extra Credit: Suppose we perform a quantum computation by physically applying a Hamiltonian H. What's to stop us from running the computation, say, $10^{100}$ times faster, by simply replacing H by $H':=10^{100}H$, which is also a valid Hamiltonian?  Or to put it differently: is there some resource that we'd need $10^{100}$ more of to do this?

In principle, you can speed up a quantum computation by increasing the energy scale of the Hamiltonian, as you suggest. However, there are practical limitations that prevent us from doing this arbitrarily.

1. Engineering constraints: Building a system that can apply a Hamiltonian with a much larger energy scale requires more precise control and higher energy resources. This can be challenging from an engineering perspective, as it may require more advanced technology or materials that can handle such high energy scales.

2. Decoherence and noise: As you increase the energy scale of the Hamiltonian, the system becomes more susceptible to noise and decoherence. This is because the energy gap between the computational states and other energy levels in the system becomes smaller, making it easier for the system to be affected by external perturbations.
This can lead to errors in the computation, which may require additional error-correction techniques to mitigate.

3. Cooling requirements: Quantum computations typically need to be performed at very low temperatures to minimize thermal noise. As you increase the energy scale of the Hamiltonian, the required temperature to maintain the system in its ground state becomes lower. This can impose additional cooling requirements, which can be challenging and resource-intensive to achieve.

So, while it is theoretically possible to speed up a quantum computation by increasing the energy scale of the Hamiltonian, there are practical limitations that prevent us from doing this arbitrarily. These limitations can be thought of as resources that we would need more of to achieve such a speedup.

JUSTIN: Up to 4 points extra credit.
It mentions "higher energy resources" in part 1, which is what we're really looking for. And I'm not sure that parts 2 and 3 are fully correct.

2/4 extra credit points.

Still here? OK then, let’s move on to Shor’s factoring algorithm! Total score here: 7 / 10.

3. Suppose we use Shor's algorithm to factor $N=105$ into $3\cdot 5 \cdot 7$.  (Yes, $N$ is now a product of 3 primes!)  Suppose also that we make the choices $x=2$ and $Q=60000$.

(a) What is the order of the multiplicative group $\mathbb{Z}_N^\times$?

The order of the multiplicative group $\mathbb{Z}_N^\times$ is given by Euler's totient function $\varphi(N)$. Since $N = 3 \cdot 5 \cdot 7$, we have $\varphi(N) = \varphi(3) \cdot \varphi(5) \cdot \varphi(7) = (3-1)(5-1)(7-1) = 2 \cdot 4 \cdot 6 = \boxed{48}$.

JUSTIN: 2/2 points.

(b) What is the period of the function $f(r)=x^r (\bmod N)$?

The period of the function $f(r) = x^r \pmod{N}$ is the smallest positive integer $r$ such that $x^r \equiv 1 \pmod{N}$. This is the order of $x$ in the multiplicative group $\mathbb{Z}_N^\times$. Since the order of the group is 48, the period $r$ must be a divisor of 48.

To find the period, we can check the divisors of 48 in increasing order. We have $x = 2$ and $N = 105$, so we compute the powers of 2 modulo 105:

- $2^1 \equiv 2 \pmod{105}$
- $2^2 \equiv 4 \pmod{105}$
- $2^3 \equiv 8 \pmod{105}$
- $2^4 \equiv 16 \pmod{105}$
- $2^6 \equiv 64 \pmod{105}$
- $2^{12} \equiv 16 \pmod{105}$
- $2^{24} \equiv 1 \pmod{105}$

We find that the smallest positive integer $r$ such that $2^r \equiv 1 \pmod{105}$ is $r = \boxed{24}$.

JUSTIN: Up to 2 points.
The correct answer is 12, i.e. 2^12 = 1 mod 105.
Given the work, this earns partial credit.

1/2 points.

(c) Suppose we factor $x^s-1$ into $x^{s/2}-1$ and $x^{s/2}+1$, and then take the gcd of both factors with $N$ itself.  Which prime factors of $N$, if any, would be ``peeled off'' this way?

Let's compute $x^{s/2} \pmod{N}$, where $x = 2$, $s = 24$, and $N = 105$. We have $x^{s/2} = 2^{12} \equiv 16 \pmod{105}$. Now we can compute the gcd of both factors with $N$:

- $\gcd(x^{s/2} - 1, N) = \gcd(16 - 1, 105) = \gcd(15, 105) = 15 = 3 \cdot 5$
- $\gcd(x^{s/2} + 1, N) = \gcd(16 + 1, 105) = \gcd(17, 105) = 1$

So, by taking the gcd of both factors with $N$, we would "peel off" the prime factors $3$ and $5$ of $N$. The prime factor $7$ is not peeled off in this process.

Up to 3 points.
Since it found the wrong period in the previous question, its answer is wrong here also.  It also seems to invert the definition of "peel off."  But it does the gcd computations correctly.

(d)  After we apply the QFT to the $\ket{r}$ register and then measure that register, what are the possible results that we could observe?

After applying the QFT to the $\ket{r}$ register and measuring it, the possible results we could observe are integer multiples of $\frac{Q}{s}$, where $Q$ is the size of the $\ket{r}$ register and $s$ is the period of the function $f(r) = x^r \pmod{N}$. In this case, we have $Q = 60000$ and $s = 24$. So, the possible results are integer multiples of $\frac{60000}{24} = 2500$.

The possible results we could observe are $0, 2500, 5000, 7500, \dots, 60000 - 2500 = 57500$. Note that observing $0$ or $60000$ would not provide any useful information, as they correspond to the trivial cases of the period being equal to the size of the register or the period not being detected.

JUSTIN: Up to 3 points.
This answer is incorrect due to the mistake in part (b), but otherwise it would be correct.
(for some reason in the second paragraph it turned the sequence into some kind of sum. That's weird.)

3/3 points.

OK, onwards to a question about Hamiltonians! Total score: 6 / 12.

(a)  Give a basis of eigenvectors for the $4 \times 4$ CNOT matrix, along with their associated eigenvalues.

The CNOT (Controlled-NOT) matrix is a $4 \times 4$ matrix given by:

CNOT = \begin{pmatrix}
1 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 \\
0 & 0 & 0 & 1 \\
0 & 0 & 1 & 0
\end{pmatrix} $$

To find the eigenvectors and eigenvalues, we first need to find the characteristic polynomial of the matrix, which is given by the determinant of $(CNOT - \lambda I)$:

\det(CNOT - \lambda I) = \det\begin{pmatrix}
1-\lambda & 0 & 0 & 0 \\
0 & 1-\lambda & 0 & 0 \\
0 & 0 & -\lambda & 1 \\
0 & 0 & 1 & -\lambda
\end{pmatrix} $$

The determinant of this matrix is $(1-\lambda)^2(-\lambda)^2 - (1-\lambda)^2 = (1-\lambda)^2(\lambda^2 - 1)$. Thus, the eigenvalues are $\lambda_1 = 1$, $\lambda_2 = 1$, $\lambda_3 = -1$, and $\lambda_4 = -1$.

Now, we find the eigenvectors associated with each eigenvalue:

1. For $\lambda_1 = 1$:

$$ (CNOT - I)v_1 = \begin{pmatrix}
0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 \\
0 & 0 & -1 & 1 \\
0 & 0 & 1 & -1
\end{pmatrix}v_1 = 0 $$

A basis for the eigenspace corresponding to $\lambda_1 = 1$ is given by $v_1 = \begin{pmatrix} 1 \\ 0 \\ 0 \\ 0 \end{pmatrix}$ and $v_2 = \begin{pmatrix} 0 \\ 1 \\ 0 \\ 0 \end{pmatrix}$.

2. For $\lambda_3 = -1$:

$$ (CNOT + I)v_3 = \begin{pmatrix}
2 & 0 & 0 & 0 \\
0 & 2 & 0 & 0 \\
0 & 0 & 1 & 1 \\
0 & 0 & 1 & 1
\end{pmatrix}v_3 = 0 $$

A basis for the eigenspace corresponding to $\lambda_3 = -1$ is given by $v_3 = \begin{pmatrix} 0 \\ 0 \\ 1 \\ -1 \end{pmatrix}$ and $v_4 = \begin{pmatrix} 0 \\ 0 \\ 1 \\ 1 \end{pmatrix}$.

Thus, a basis of eigenvectors for the CNOT matrix is given by:

v_1 = \begin{pmatrix} 1 \\ 0 \\ 0 \\ 0 \end{pmatrix}, \quad
v_2 = \begin{pmatrix} 0 \\ 1 \\ 0 \\ 0 \end{pmatrix}, \quad
v_3 = \begin{pmatrix} 0 \\ 0 \\ 1 \\ -1 \end{pmatrix}, \quad
v_4 = \begin{pmatrix} 0 \\ 0 \\ 1 \\ 1 \end{pmatrix} $$

with associated eigenvalues $\lambda_1 = 1$, $\lambda_2 = 1$, $\lambda_3 = -1$, and $\lambda_4 = -1$.

JUSTIN: Up to 3 points.
The eigenvalue lambda4 is wrong, it should be +1.

(b) Give a 2-qubit Hamiltonian $H$ such that $e^{-iH}$ is the CNOT matrix.

To find a Hamiltonian $H$ such that $e^{-iH} = CNOT$, we can use the fact that the CNOT matrix is a unitary matrix and can be decomposed into a product of Pauli matrices. The Pauli matrices are given by:

\sigma_x = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}, \quad
\sigma_y = \begin{pmatrix} 0 & -i \\ i & 0 \end{pmatrix}, \quad
\sigma_z = \begin{pmatrix} 1 & 0 \\ 0 & -1 \end{pmatrix} $$

The CNOT matrix can be written as a product of tensor products of Pauli matrices:

CNOT = \frac{1}{2} (I \otimes I + I \otimes \sigma_x + \sigma_z \otimes I - \sigma_z \otimes \sigma_x) $$

Now, we can find a Hamiltonian $H$ such that $e^{-iH} = CNOT$. We can use the fact that the exponential of a sum of operators is equal to the product of exponentials of the individual operators if the operators commute:

$$ e^{-iH} = e^{-i\frac{\theta}{2}(I \otimes I)} e^{-i\frac{\theta}{2}(I \otimes \sigma_x)} e^{-i\frac{\theta}{2}(\sigma_z \otimes I)} e^{i\frac{\theta}{2}(\sigma_z \otimes \sigma_x)} $$

We want to find the angle $\theta$ such that this product is equal to the CNOT matrix. Since the Pauli matrices have eigenvalues $\pm 1$, we can see that the angle $\theta$ must be $\pi$:

$$ e^{-iH} = e^{-i\frac{\pi}{2}(I \otimes I)} e^{-i\frac{\pi}{2}(I \otimes \sigma_x)} e^{-i\frac{\pi}{2}(\sigma_z \otimes I)} e^{i\frac{\pi}{2}(\sigma_z \otimes \sigma_x)} $$

Now, we can find the Hamiltonian $H$:

$$ H = \frac{\pi}{2}(I \otimes I + I \otimes \sigma_x + \sigma_z \otimes
I - \sigma_z \otimes \sigma_x) $$

So, the 2-qubit Hamiltonian $H$ such that $e^{-iH} = CNOT$ is given by:

H = \frac{\pi}{2} \begin{pmatrix}
1 & 0 & 0 & 0 \\
0 & 1 & 2 & 0 \\
0 & 0 & 1 & 0 \\
0 & 0 & 0 & -1
\end{pmatrix} $$

JUSTIN: Up to 2 points.
Plugging into a calculator, this answer is incorrect. 
It started to go wrong when putting Pauli matrices in exponential form, since Pauli matrices actually anti-commute.  Partial credit for very significant effort.

1/2 points.

(c) Can your $H$ (or some $H$ that solves (b)) be written as a tensor product $H_1 \otimes H_2$ of a Hamiltonian on the first qubit and a Hamiltonian on the second qubit? If so, how?

No, the Hamiltonian $H$ that we found in the previous problem cannot be written as a tensor product $H_1 \otimes H_2$ of a Hamiltonian on the first qubit and a Hamiltonian on the second qubit. The reason is that the CNOT operation is an entangling operation, meaning that it creates entanglement between the two qubits. If the Hamiltonian could be written as a tensor product of Hamiltonians acting on individual qubits, it would not be able to create entanglement between the qubits.

To see this more explicitly, let's try to write the Hamiltonian $H$ as a tensor product:

$$ H = H_1 \otimes H_2 = \begin{pmatrix} a & b \\ c & d \end{pmatrix}
\otimes \begin{pmatrix} e & f \\ g & h \end{pmatrix} $$

The tensor product would result in a $4 \times 4$ matrix:

$$ H = \begin{pmatrix}
a e & a f & b e & b f \\
a g & a h & b g & b h \\
c e & c f & d e & d f \\
c g & c h & d g & d h
\end{pmatrix} $$

Comparing this with the Hamiltonian $H$ we found earlier:

$$ H = \frac{\pi}{2} \begin{pmatrix}
1 & 0 & 0 & 0 \\
0 & 1 & 2 & 0 \\
0 & 0 & 1 & 0 \\
0 & 0 & 0 & -1
\end{pmatrix} $$

We can see that there is no way to find values for $a, b, c, d, e, f, g, h$ such that the two matrices are equal. This confirms that the Hamiltonian $H$ cannot be written as a tensor product of Hamiltonians acting on individual qubits.

JUSTIN: Up to 3 points.
A correct solution to part (b) written in the form of part (c) is ((0 0) (0 1)) \otimes pi/2 ((1 -1) (-1 1)).
This question is a bit harsh with partial/carry-through credit, but the answer also explicitly got the main idea wrong, claiming that a tensor product H cannot give an entangling unitary.

(d) Is there an $H$ that solves (b), while also being a sum $H_1\otimes I + I\otimes H_2$ of a Hamiltonian on the first qubit and a Hamiltonian on the second qubit?  Why or why not?

No, there is no Hamiltonian $H$ such that $e^{iH} = CNOT$ and $H = H_1 \otimes I + I \otimes H_2$. The reason is that the CNOT operation is an entangling operation, meaning that it creates entanglement between the two qubits. If the Hamiltonian could be written as a sum of Hamiltonians acting on individual qubits, it would not be able to create entanglement between the qubits.

To see this more explicitly, let's assume that there exists such an $H$:

$$ H = H_1 \otimes I + I \otimes H_2 = \begin{pmatrix} a & b \\ c & d
\end{pmatrix} \otimes \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} +
\begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \otimes \begin{pmatrix} e
& f \\ g & h \end{pmatrix} $$

The sum of the tensor products would result in a $4 \times 4$ matrix:

$$ H = \begin{pmatrix}
a + e & b & a & b \\
c & d + e & c & d \\
a & b & a + h & b \\
c & d & c & d + h
\end{pmatrix} $$

Now, we want to find an $H$ such that $e^{iH} = CNOT$. The CNOT matrix is given by:

$$ CNOT = \begin{pmatrix}
1 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 \\
0 & 0 & 0 & 1 \\
0 & 0 & 1 & 0
\end{pmatrix} $$

However, there is no way to find values for $a, b, c, d, e, f, g, h$ such that $e^{iH} = CNOT$. This is because the CNOT operation creates entanglement between the qubits, and a Hamiltonian that can be written as a sum of Hamiltonians acting on individual qubits cannot create entanglement.

In summary, there is no Hamiltonian $H$ such that $e^{iH} = CNOT$ and $H = H_1 \otimes I + I \otimes H_2$ because the CNOT operation is an entangling operation, and a Hamiltonian that can be written as a sum of Hamiltonians acting on individual qubits cannot create entanglement.

Up to 4 points.
It's correct that an H of this form can't generate CNOT because it can't create entanglement, although most of the rest is just repetitive fluff.

Now for a question about the complexity of generating certain uniform superpositions over sets (a special case of “QSampling”). Total score: 19 / 18 (of course counting extra credit).

5. In the Graph Isomorphism problem, we're given as input two $n$-vertex undirected graphs $G$ and $H$.  The problem is to determine whether they're isomorphic---in other words, whether there's any permutation of the vertex labels that makes $G$ and $H$ equal.

(a) Given as input an $n$-vertex graph $G$, describe how to sample, in classical $\mathrm{poly}(n)$ time, from a probability distribution $D_G$ over graphs such that:
- Whenever the graphs $G$ and $H$ are isomorphic, $D_G = D_H$.
- Whenever $G$ and $H$ are non-isomorphic, $D_G$ and $D_H$ have disjoint supports (i.e., no graph appears with nonzero probability in both of them).

To sample from the probability distribution $D_G$ over graphs, we can follow these steps:

1. Compute the canonical form of the input graph $G$. This can be done using a graph canonization algorithm, such as the one by Brendan McKay (nauty) or the one by Laszlo Babai (Traces). These algorithms run in classical $\mathrm{poly}(n)$ time.

2. Generate a random permutation of the vertex labels of the canonical form of $G$. This can be done by selecting a random permutation of the integers from $1$ to $n$, and then applying this permutation to the vertex labels of the canonical form.

3. Apply the random permutation to the canonical form of $G$ to obtain a new graph $G'$. This graph $G'$ is a sample from the probability distribution $D_G$.

The above procedure ensures that whenever the graphs $G$ and $H$ are isomorphic, $D_G = D_H$, because their canonical forms will be the same, and the random permutations applied to their canonical forms will generate the same distribution over graphs. On the other hand, whenever $G$ and $H$ are non-isomorphic, their canonical forms will be different, and the random permutations applied to their canonical forms will generate disjoint supports for $D_G$ and $D_H$.

JUSTIN: Up to 5 points.
The first step of computing the canonical form is superfluous and incorrect, since computing the canonical form is at least as hard as solving graph isomorphism and we don't know how to solve that in polynomial time.
Also, it looks like step 2 or 3 should have been about graph H instead of both being about G.
So it seems to have exactly the right idea but described it poorly.

(b) Given a probability distribution $D=(p_x)$ over $n$-bit strings $x$, define the ``QSampling state'' of $D$ to be $$\ket{\psi_D} := \sum_{x \in \{0,1\}^n} \sqrt{p_x} \ket{x}$$
Given two probability distributions $D$ and $D'$, suppose that the quantum circuit $\mathcal{C}$ maps $\ket{0^n}$ to $\ket{\psi_D}$, while the circuit $\mathcal{C}'$ maps $\ket{0^n}$ to $\ket{\psi_{D'}}$.

Then what is the output state of the circuit shown below, which acts on $n+1$ qubits?
$$ \Qcircuit @C=.5em @R=0.5em @!R {
\lstick{\ket{0}} & \gate{H} & \ctrl{1} & \gate{NOT} &  \ctrl{1} & \qw\\
\lstick{\ket{0^n}} & \qw & \gate{\mathcal{C}} & \qw & \gate{\mathcal{C}'} & \qw } $$

To find the output state of the given circuit, we first analyze it step by step.

1. After applying the Hadamard gate (H) on the first qubit, the state becomes:
$$\frac{1}{\sqrt{2}}(\ket{0} + \ket{1}) \otimes \ket{0^n}$$

2. Next, the controlled-$\mathcal{C}$ gate is applied. This gate applies the $\mathcal{C}$ operation on the second register if the first qubit is in state $\ket{0}$. So, the state becomes:
$$\frac{1}{\sqrt{2}}(\ket{0} \otimes \ket{\psi_D} + \ket{1} \otimes \ket{0^n})$$

3. The NOT gate is applied on the first qubit, which flips the state of the first qubit. The state becomes:
$$\frac{1}{\sqrt{2}}(\ket{1} \otimes \ket{\psi_D} + \ket{0} \otimes \ket{0^n})$$

4. Finally, the controlled-$\mathcal{C}'$ gate is applied. This gate applies the $\mathcal{C}'$ operation on the second register if the first qubit is in state $\ket{0}$. So, the state becomes:
$$\frac{1}{\sqrt{2}}(\ket{1} \otimes \ket{\psi_D} + \ket{0} \otimes \ket{\psi_{D'}})$$

Thus, the output state of the circuit is:
$$\frac{1}{\sqrt{2}}(\ket{1} \otimes \ket{\psi_D} + \ket{0} \otimes \ket{\psi_{D'}})$$

JUSTIN: 4/4.
I think solving this one is impressive.

(c) Now suppose we measure the first qubit of that output state in the $\{\ket{+},\ket{-}\}$ basis.  What is the probability of the outcome $\ket{+}$ if $D=D'$?  What about if $D$ and $D'$ have disjoint supports?

To find the probability of the outcome $\ket{+}$, we first express the output state in terms of the $\{\ket{+},\ket{-}\}$ basis. Recall that $\ket{+} = \frac{1}{\sqrt{2}}(\ket{0} + \ket{1})$ and $\ket{-} = \frac{1}{\sqrt{2}}(\ket{0} - \ket{1})$.

The output state is:
$$\frac{1}{\sqrt{2}}(\ket{1} \otimes \ket{\psi_D} + \ket{0} \otimes \ket{\psi_{D'}})$$

We can rewrite this state in terms of the $\{\ket{+},\ket{-}\}$ basis as follows:
$$\frac{1}{2}(\ket{+} \otimes (\ket{\psi_D} + \ket{\psi_{D'}}) + \ket{-} \otimes (\ket{\psi_D} - \ket{\psi_{D'}}))$$

Now, we can find the probability of the outcome $\ket{+}$ by taking the squared magnitude of the coefficient of $\ket{+}$ in the state.

Case 1: $D = D'$

In this case, $\ket{\psi_D} = \ket{\psi_{D'}}$. So, the output state becomes:
$$\frac{1}{2}(\ket{+} \otimes (2\ket{\psi_D}) + \ket{-} \otimes (0))$$

The probability of the outcome $\ket{+}$ is the squared magnitude of the coefficient of $\ket{+}$, which is:
$$\left|\frac{1}{2}(2)\right|^2 = 1$$

Case 2: $D$ and $D'$ have disjoint supports

In this case, $\ket{\psi_D}$ and $\ket{\psi_{D'}}$ are orthogonal, i.e., their inner product is zero: $\langle\psi_D|\psi_{D'}\rangle = 0$. The probability of the outcome $\ket{+}$ is the squared magnitude of the coefficient of $\ket{+}$, which is:
$$\left|\frac{1}{2}(\langle\psi_D|\psi_D\rangle + \langle\psi_{D'}|\psi_{D'}\rangle)\right|^2$$

Since $\langle\psi_D|\psi_D\rangle = 1$ and $\langle\psi_{D'}|\psi_{D'}\rangle = 1$ (both states are normalized), the probability of the outcome $\ket{+}$ is:
$$\left|\frac{1}{2}(1 + 1)\right|^2 = \frac{1}{4}(2)^2 = \frac{1}{2}$$

So, if $D = D'$, the probability of the outcome $\ket{+}$ is 1, and if $D$ and $D'$ have disjoint supports, the probability of the outcome $\ket{+}$ is $\frac{1}{2}$.

JUSTIN: 5/5.

(d) Suppose your distributions $D_G$ from part (a) could be efficiently QSampled.  Using your previous work, explain how Graph Isomorphism could then be solved in $BQP$ (quantum polynomial time).

Suppose we can efficiently QSample the distributions $D_G$ and $D_H$ for two input graphs $G$ and $H$. We can then construct the QSampling states $\ket{\psi_{D_G}}$ and $\ket{\psi_{D_H}}$ for these distributions using the quantum circuits $\mathcal{C}_G$ and $\mathcal{C}_H$, respectively, as described in the previous problems.

Now, consider the following quantum circuit acting on $n+1$ qubits:

$$ \Qcircuit @C=.5em @R=0.5em @!R { \lstick{\ket{0}} & \gate{H} & \ctrl{1} & \gate{NOT} &  \ctrl{1} & \qw\\  \lstick{\ket{0^n}} & \qw & \gate{\mathcal{C}_G} & \qw & \gate{\mathcal{C}_H} & \qw } $$

As we found earlier, the output state of this circuit is:

$$\frac{1}{\sqrt{2}}(\ket{1} \otimes \ket{\psi_{D_G}} + \ket{0} \otimes \ket{\psi_{D_H}})$$

Now, we measure the first qubit of the output state in the $\{\ket{+},\ket{-}\}$ basis. We know from our previous work that if $G$ and $H$ are isomorphic (i.e., $D_G = D_H$), the probability of the outcome $\ket{+}$ is 1, and if $G$ and $H$ are non-isomorphic (i.e., $D_G$ and $D_H$ have disjoint supports), the probability of the
outcome $\ket{+}$ is $\frac{1}{2}$.

Thus, we can solve the Graph Isomorphism problem in $BQP$ as follows:

1. Construct the QSampling states $\ket{\psi_{D_G}}$ and $\ket{\psi_{D_H}}$ for the input graphs $G$ and $H$ using the quantum circuits $\mathcal{C}_G$ and $\mathcal{C}_H$.

2. Implement the above quantum circuit and measure the first qubit in the $\{\ket{+},\ket{-}\}$ basis.

3. If the outcome is $\ket{+}$ with probability close to 1, conclude that the graphs $G$ and $H$ are isomorphic. If the outcome is $\ket{+}$ with probability close to $\frac{1}{2}$, conclude that the graphs $G$ and $H$ are non-isomorphic.

This algorithm runs in quantum polynomial time, as the QSampling circuits $\mathcal{C}_G$ and $\mathcal{C}_H$ are assumed to be efficient, and the rest of the operations (Hadamard, NOT, and controlled gates) are also efficient.

JUSTIN: The answer is longer than necessary, repeating some of its previous work.
Also, in theory, if the graphs are isomorphic then the probability of |+> will be exactly 1 not close to 1.


(e) So then why doesn't this approach immediately imply a fast quantum algorithm for Graph Isomorphism?  Explain what could go wrong in passing from fast algorithms to sample $D_G$ and $D_H$, to fast algorithms to QSample them.

The approach described above does not immediately imply a fast quantum algorithm for Graph Isomorphism because there is a crucial difference between classical sampling and QSampling.

In the algorithm, we assumed that we can efficiently construct quantum circuits $\mathcal{C}_G$ and $\mathcal{C}_{G'}$ that map $\ket{0^n}$ to $\ket{\psi_{D_G}}$ and $\ket{\psi_{D_{G'}}}$, respectively, given that the distributions $D_G$ and $D_{G'}$ can be efficiently sampled classically. However, this assumption might not hold in general.

Classical sampling allows us to draw samples from the distributions $D_G$ and $D_{G'}$, but it does not directly provide a way to prepare the corresponding QSampling states $\ket{\psi_{D_G}}$ and $\ket{\psi_{D_{G'}}}$. Preparing these quantum states requires a coherent superposition of all the elements in the distributions with the appropriate amplitudes, which might not be efficiently achievable using the classical sampling algorithms.

In other words, the ability to efficiently sample from a probability distribution classically does not necessarily imply the ability to efficiently prepare the corresponding QSampling state on a quantum computer. This is the main reason why the approach does not immediately lead to a fast quantum algorithm for Graph Isomorphism.

SCOTT: Correctly explains that sampling doesn't necessarily imply QSampling (i.e. generating a coherent superposition state), but doesn't say that this is because of the difficulty of uncomputing garbage.
2 / 5 extra credit points.

Next, a problem about quantum communication complexity. Total score: 5 / 10.

6. Suppose Alice and Bob hold $n$-bit strings $x=(x_1,...,x_n)$ and $y=(y_1,...,y_n)$ respectively.  One thing they might want to learn is the mod-2 inner product of their strings, $$x_1 y_1 + \ldots + x_n y_n (\bmod 2).$$

(a) Suppose Alice and Bob had a quantum communication protocol in which they are allowed to exchange up to $T$ qubits and to perform arbitrary local unitary transformations to their qubits (possibly including ancilla qubits), that ended with Bob knowing the above inner product, with success probability 1.  Explain how, by exchanging the same number of qubits $T$, Bob could also prepare an $n$-qubit state
of the form $$\frac{1}{\sqrt{2^n}} \sum_{z \in \{0,1\}^n} (-1)^{x\cdot z} \ket{z},$$
where $x$ is an $n$-bit string held by Alice.

If Alice and Bob have a quantum communication protocol that allows Bob to know the mod-2 inner product of their strings with success probability 1, then they can use the same protocol to prepare the desired $n$-qubit state.

Here's how they can do it:

1. Alice prepares an $n$-qubit state $\ket{\phi}$, where
$$\ket{\phi} = \frac{1}{\sqrt{2^n}} \sum_{z \in \{0,1\}^n} \ket{z}.$$

2. Alice applies a controlled-phase gate to each qubit of $\ket{\phi}$, using her $n$-bit string $x$ as the control. This results in the state
$$\frac{1}{\sqrt{2^n}} \sum_{z \in \{0,1\}^n} (-1)^{x\cdot z} \ket{z}.$$

3. Alice sends the $n$-qubit state to Bob.

4. Bob uses the quantum communication protocol to learn the mod-2 inner product of their strings. Since the protocol has success probability 1, Bob will know the inner product with certainty.

5. Bob applies a local unitary transformation to the $n$-qubit state he received from Alice, based on the inner product he learned in step 4. This will result in the desired state
$$\frac{1}{\sqrt{2^n}} \sum_{z \in \{0,1\}^n} (-1)^{x\cdot z} \ket{z}.$$

Since Alice and Bob only exchanged $T$ qubits in the original protocol, they can also prepare the desired $n$-qubit state by exchanging the same number of qubits $T$.

JUSTIN: Up to 4 points.
This answer is partially correct and/or poorly worded. In particular, in step 3 Alice just sends the whole n-qubit state to Bob.
The right idea is in there, but doesn't earn full points.


(b) Assume Alice and Bob have no preshared entanglement.  Recall Holevo's Theorem, which implies that in order to communicate $n$ bits to Bob reliably, Alice must send Bob at least $n$ qubits.  Using Holevo's Theorem together with part (a), prove that Alice and Bob must exchange at least $n$ qubits, even if they only want to learn the inner product mod 2 of their input strings $x$ and $y$.

From part (a), we know that if Alice and Bob have a quantum communication protocol that allows Bob to know the mod-2 inner product of their strings with success probability 1, then they can also prepare an $n$-qubit state of the form
$$\frac{1}{\sqrt{2^n}} \sum_{z \in \{0,1\}^n} (-1)^{x\cdot z} \ket{z},$$
by exchanging the same number of qubits $T$.

Now, let's consider the reverse process. Suppose Bob has the $n$-qubit state
$$\frac{1}{\sqrt{2^n}} \sum_{z \in \{0,1\}^n} (-1)^{x\cdot z} \ket{z}.$$
He can perform a quantum measurement in the computational basis to obtain a random $n$-bit string $z$. Then, he can compute the mod-2 inner product $x \cdot z$ and send it back to Alice.

Alice, knowing her own string $x$, can compute the mod-2 inner product $x \cdot z$ as well. If the inner product $x \cdot z$ that Bob sent to Alice matches the one she computed, then they have successfully communicated the inner product mod 2 of their input strings $x$ and $y$.

However, according to Holevo's Theorem, in order to communicate $n$ bits to Bob reliably, Alice must send Bob at least $n$ qubits. Since the $n$-qubit state prepared in part (a) effectively communicates $n$ bits of information (Alice's string $x$) to Bob, Holevo's Theorem implies that Alice and Bob must exchange at least $n$ qubits to prepare this state.

Therefore, even if Alice and Bob only want to learn the inner product mod 2 of their input strings $x$ and $y$, they must exchange at least $n$ qubits.

JUSTIN: Up to 6 points.
This answer doesn't earn any points. It only says that Alice and Bob learn the inner products and mistakenly claims this is n bits of information.
The correct solution recognizes that applying Hadamard to the state we showed how to prepare in part (a) gives exactly |x>.


(c) Now suppose we're no longer working mod 2, and Alice and Bob want to know whether their inner product
$$x_1 y_1 + \ldots + x_n y_n$$
is zero or nonzero as an integer.  (In other words, whether there's an $i$ such that $x_i=y_i=1$.)  Describe a protocol by which Alice and Bob can accomplish this, with high probability, by exchanging only $O(\sqrt{n}\log n )$ qubits in total.  The qubits can be spread across as many rounds of communication as necessary, and can be sent in either direction.

Alice and Bob can use the following protocol, based on the Grover's search algorithm, to determine whether their inner product is zero or nonzero as an integer, with high probability, by exchanging only $O(\sqrt{n}\log n)$ qubits in total:

1. Alice prepares a quantum register of $n$ qubits in the state $\ket{0}^{\otimes n}$.

2. Alice applies the Hadamard gate to each qubit, resulting in the state $$\ket{\psi} = \frac{1}{\sqrt{2^n}} \sum_{z \in \{0,1\}^n} \ket{z}.$$

3. Alice and Bob perform the following steps for $O(\sqrt{n})$ iterations:

   a. Alice sends the $n$-qubit state $\ket{\psi}$ to Bob.

   b. Bob applies a controlled-phase gate to each qubit of $\ket{\psi}$, using his $n$-bit string $y$ as the control. This results in the state
   $$\ket{\psi'} = \frac{1}{\sqrt{2^n}} \sum_{z \in \{0,1\}^n} (-1)^{y\cdot z} \ket{z}.$$

   c. Bob sends the $n$-qubit state $\ket{\psi'}$ back to Alice.

   d. Alice applies a controlled-phase gate to each qubit of $\ket{\psi'}$, using her $n$-bit string $x$ as the control. This results in the state
   $$\ket{\psi''} = \frac{1}{\sqrt{2^n}} \sum_{z \in \{0,1\}^n} (-1)^{(x+y)\cdot z} \ket{z}.$$
   Note that if $x_i = y_i = 1$ for some $i$, then the phase of the corresponding term in the sum will be flipped twice, resulting in no change. Therefore, the state $\ket{\psi''}$ will have a non-zero amplitude for the terms corresponding to the indices where $x_i = y_i = 1$.

   e. Alice applies the Grover's diffusion operator to the state $\ket{\psi''}$, which amplifies the amplitudes of the terms corresponding to the indices where $x_i = y_i = 1$. The diffusion operator can be implemented as follows:
      - Apply the Hadamard gate to each qubit.
      - Apply a multi-qubit controlled-Z gate, with all qubits as controls and an ancilla qubit as the target.
      - Apply the Hadamard gate to each qubit again.

   f. Alice updates the state $\ket{\psi}$ with the resulting state after applying the Grover's diffusion operator.

4. After $O(\sqrt{n})$ iterations, Alice measures the final state in the computational basis. If she obtains a non-zero result, then there exists an $i$ such that $x_i = y_i = 1$, and their inner product is nonzero as an integer. If she obtains a zero result, then their inner product is zero.

Since Alice and Bob exchange $n$ qubits in each iteration, and they perform $O(\sqrt{n})$ iterations, the total number of qubits exchanged is $O(n\sqrt{n}) = O(\sqrt{n}\log n)$, as each qubit can represent $\log n$ bits of information. Thus, Alice and Bob can determine whether their inner product is zero or nonzero as an integer, with high probability, by exchanging only $O(\sqrt{n}\log n)$ qubits in total.


JUSTIN: Up to 6 extra credit points.
Yeah, step 3a says it's sending n qubits, when it's actually log n.
It also seems incorrect beginning in the analysis of step d. It switches from summing over |z> and referring to the full strings x and y to referring to individual indices i,x_i,y_i. And I don't know what it means about some amplitudes going to zero since what we're really doing is having some amplitudes be +1 and some -1.
I don't think this protocol is correct as stated.
But, it has some of the main ideas. The analysis of the grover steps is good and using a phase gate is good.

3/6 extra credit.

OK, last question, about the k-SUM problem. Total score: 6 / 15.

7. In the famous $k$-SUM problem, we're given a list of integers $x_1,\ldots,x_n$, and are asked whether there are $k$ distinct indices, $i_1 < \ldots < i_k$, such that $x_{i_1} + \ldots + x_{i_k} = 0$.

For this problem, you can ignore factors of $\log n$ in the running time (indeed, that is what the $\widetilde{O}$ notation means).

(a) Assume $k$ is even AND that we are allowed multi-sets (aka repeated elements are allowed). Describe a classical algorithm that solves the $k$-SUM problem in $\widetilde{O}(n^{k/2})$ time, beating the trivial upper bound of $\widetilde{O}(n^k)$.

We can solve the k-SUM problem in $\widetilde{O}(n^{k/2})$ time using a meet-in-the-middle approach. Here's a high-level description of the algorithm:

1. Divide the list of integers into two equal halves: $A = x_1, \ldots, x_{n/2}$ and $B = x_{n/2+1}, \ldots, x_n$.

2. Compute all possible sums of $k/2$ elements from each half. For each half, this can be done in $O(n^{k/2})$ time by iterating through all possible combinations of $k/2$ elements.

3. Sort the lists of sums for both halves. This can be done in $O(n^{k/2} \log n)$ time using a comparison-based sorting algorithm.

4. For each sum in the first half, perform a binary search in the second half to find a sum that is equal to the negation of the sum from the first half. If such a sum is found, then we have found k distinct indices whose elements sum to 0. Binary search takes $O(\log n)$ time, so this step takes $O(n^{k/2} \log n)$ time.

Since we are ignoring factors of $\log n$, the overall running time of the algorithm is $\widetilde{O}(n^{k/2})$.


JUSTIN: Up to 5 points.
Right, step 1 is incorrect. Instead, it should generate all O(n^{k/2}) sums of subsets of size k/2. Nothing about dividing into two halves.


(b) Assume $k$ is divisible by 3 and that we are again allowed multi-sets.  Describe a quantum algorithm that solves the $k$-SUM problem in $\widetilde{O}(n^{k/3})$ time.

We can solve the k-SUM problem in $\widetilde{O}(n^{k/3})$ time using a quantum algorithm based on Grover's search. Here's a high-level description of the algorithm:

1. Divide the list of integers into three equal parts: $A = x_1, \ldots, x_{n/3}$, $B = x_{n/3+1}, \ldots, x_{2n/3}$, and $C = x_{2n/3+1}, \ldots, x_n$.

2. Compute all possible sums of $k/3$ elements from each part. For each part, this can be done in $O(n^{k/3})$ time by iterating through all possible combinations of $k/3$ elements.

3. Sort the lists of sums for all three parts. This can be done in $O(n^{k/3} \log n)$ time using a comparison-based sorting algorithm.

4. Define an oracle function $f$ that takes as input a sum from the first part and a sum from the second part, and returns 1 if there exists a sum in the third part that is equal to the negation of the sum of the inputs, and 0 otherwise. This oracle can be implemented using a binary search on the sorted list of sums for the third part, which takes $O(\log n)$ time.

5. Use Grover's search algorithm to find a pair of sums from the first and second parts that satisfy the oracle function. Grover's search has a quadratic speedup over classical search, so it takes $O(\sqrt{n^{k/3}}) = O(n^{k/6})$ time.

Since we are ignoring factors of $\log n$, the overall running time of the algorithm is $\widetilde{O}(n^{k/3})$.

JUSTIN: Up to 5 points.
Again, step 1 is incorrect, it shouldn't divide into three parts.
The division into three parts is more problematic this time as it comes up in step 4.
And it ends up claiming n^{k/6] time in step 5, better than the n^{k/3}.
Basically all this gets correct is to use Grover in some way.


(c) Suppose we wanted to prove that the algorithm from (b) was the fastest possible quantum algorithm for $k$-SUM.  Could that be shown via a lower bound on $k$-SUM's quantum query complexity?  Why or why not?

Proving that the algorithm from (b) is the fastest possible quantum algorithm for k-SUM would require showing a lower bound on k-SUM's quantum query complexity that matches the upper bound achieved by the algorithm. However, a lower bound on the quantum query complexity alone might not be sufficient to prove the optimality of the algorithm, as it only considers the number of queries to the oracle function and does not account for other computational resources, such as time or space complexity.

To show that the algorithm from (b) is the fastest possible quantum algorithm for k-SUM, one would need to establish a lower bound on the quantum time complexity or some other relevant complexity measure that matches the upper bound achieved by the algorithm. This would provide a more comprehensive understanding of the problem's inherent difficulty and the limitations of quantum algorithms in solving it.

JUSTIN: Up to 5 extra credit points.
This is a nice description of query vs time complexity, but misses the point that query complexity can prove a bound of at most Omega(n) since there are at most n potential queries to make.

1/5 extra credit points.

204 Responses to “GPT-4 gets a B on my quantum computing final exam!”

  1. Craig Gidney Says:

    I have a nitpick. GPT got question 10 wrong, not right. The question is “Fault-tolerant quantum computation requires a continual process of intermediate measurements and insertion of clean qubits”. GPT answered True, which was graded Correct. But continual insertion of clean qubits isn’t required. Only the intermediate measurements are required.

    The key idea here is in a stabilizer circuit you don’t need to correct errors, only track errors, and you can treat a measurement as a preparation-with-known-error-to-track. Concretely, you transform any fault tolerant stabilizer circuit that uses resets into one that doesn’t, but that is still fault tolerant, by performing the following steps:

    1) As input, take a fault tolerant circuit with annotated detectors and observables. Here I’ll be assuming it is provided as a stim circuit.

    2) Decompose each reset gate (each introduce-clean-qubit-gate) into a measurement gate followed by a NOT gate controlled by the measurement result.

    3) Run stim.Circuit.with_inlined_feedback ( ) and return the result. This deletes the feedback operation by adjusting which measurements are compared in order to notice detection events and to recover the logical observable.

    The result is a circuit that has the same fault tolerant structure (the various detectors and observables are sensitive to errors in the same circuit locations as in the original circuit) but which uses no reset gates (no clean qubit injection). You can give completely mixed qubits as the input into this circuit, and it will function correctly.

  2. Sebastian Says:

    I think it might have done better in the true/false section if you had prompted it to put the explanation before the answer.

    I don’t think that’s cheating, since it is a general prompting technique and not at all specific to this task.

    I would be very concerned about dataset contamination. Even if this exam was not in the training data, it seems conceivable to me that discussions between students about the questions might have been included. Or maybe someone else came up with the same (or very similar questions)?

  3. Argoyle Says:

    It is fascinating to see the impressive performance of GPT-4 on the quantum computing exam, scoring a solid B with 73 out of 100. This shows that the AI has made significant progress in understanding complex concepts and is not just limited to labor economics. The fact that GPT-4 performed well on true/false questions and conceptual questions, which many students struggled with, indicates that the AI model has gained a strong foundation in quantum computing concepts.

    However, as mentioned, GPT-4 was weaker in calculation questions, where it could identify the right calculation approach but struggled with executing the calculations. Integrating the AI with WolframAlpha might improve its performance in this area, but it is yet to be tested.

    It is essential to note that GPT-4 was “flying blind,” without having taken the course or having access to other related resources. This is in stark contrast to the students who had the advantage of weekly problem sets, recitation sections, office hours, and other resources. This highlights the potential of GPT-4 to perform even better if it had access to more fine-tuning or few-shot prompting with relevant exams or lecture notes.

    As GPT-4 continues to evolve, one can’t help but wonder whether we are witnessing the dawn of a new era where AI models can “shake off” their “bad reputation” and establish themselves as reliable, intelligent entities capable of understanding and solving complex problems across various domains. With its recent achievements, GPT-4 seems to be “the wolf” that AI enthusiasts have been waiting for, and it will be interesting to see how it continues to develop and contribute to different fields in the future.

  4. Scott Says:

    Argyle #3: GPT-generated? 🙂

  5. Scott Says:

    Craig Gidney #1: I would say that adaptive measurement in itself effectively amounts to the insertion of clean qubits, the way it’s used in QECC. But you’re right, the question was unclear as stated, and I wouldn’t use it again without changing it.

  6. Scott Says:

    Sebastian #2: At some point the very definition of “contamination” becomes blurry to me. It’s like, should gold medalists in the International Math Olympiad be disqualified if it turns out they’ve done so many thousands of past math competition problems that no problem is truly new to them? Or is that just what you do if you want to win an IMO gold medal?

  7. Bog Boy Says:


    I’m in some discord servers that critique/mock LessWrong-style rationalism. There’s a rumor going around that you (and the other Scott, and Elizier) are a huge closeted Andrew Tate fan/supporter, and therefore a negative influence on your students and younger blog readers. Could you please say definitively for the record whether this is true—are you in fact a Tate fan?

  8. Scott Says:

    Bog Boy #7: Speaking for myself, absolutely not.

    And no more off-topic comments. I can no longer tell which ones are meant in earnest and which are yet more troll attacks.

  9. AVoropaev Says:

    Can you please tell more about your methodology? Like was every question asked in the separate playground, or all in one? How have you “communicated” that you want answers to a test (e.g. in GPT-4 paper they said something about giving examples of answered questions)?

    I can probably think of other questions, but you simply sharing the whole text would be ideal.

  10. Sebastian Says:

    Scott #6: I agree there is a grey area but if the model either saw this exam or discussion about this exact exam then I would consider that to be contamination.

    Do you have an exam which was written after the training cut-off date?

  11. Craig Gidney Says:

    Scott #5:

    I agree that the decomposition of a reset into an adaptive flip is still a clean qubit insertion. But I want to emphasize that I then removed the adaptive flips.

    I would describe the transformation as replacing low-entropy qubits (“clean qubits”) with high-mutual-information-vs-control-system qubits (“known qubits”).

  12. Steven Says:


    This reinforces what I’ve come to see with GPT-4. It is really good at answering domain specific questions and solving problems given the knowledge base in that domain. My team uses it for these types of questions often these days (I’m a lead for a team of EE/SWE ) and it gets things correct ~85% of the time and that 15% of wrong answers often have minor mistakes.

    What it lacks though is solving novel problems, or when we need a new way of doing something or a new technique or some new idea to unblock us or to make a better product (from a technical perspective). It either gives useless answers (which also tend to be very verbose), or starts making things up we’ve found.

    Although to be fair, these are the hardest problems engineers and humans typically face anyways, and they are often solved in a group (people building off of others ideas, etc)

  13. ClassicalComputer Says:

    Undeniably amazing, but “this exam has never before been posted on the public Internet” is different from “a large fraction of the questions have never appeared on the public internet”, right? Especially for the T/F.

  14. AI gets a B grade on a quantum computing exam – Rapid AI News Says:

    […] Full article: Shtetl-Optimized » Blog Archive » GPT-4 gets a B on my quantum computing final exam! […]

  15. Scott Says:

    AVoropaev #9: Generally, I started a new chat session for each of the 7 problems, and then pasted in each subproblem ((a), (b), etc) as a new prompt within the chat. I did not give GPT multiple chances to solve the same problem.

    Anyone who’s interested is strongly encouraged to give this exam themselves to GPT-4 (or ChatGPT or other language models), and see if they get similar results! Of course, even with GPT-4, the results are unlikely to be exactly the same, because of the nonzero temperature and perhaps because of tiny formatting issues (line breaks, etc).

  16. Gopinath Varadharajan Says:

    GPT – 4 did much better (and almost 90%) in math calculations, if you prompt it, to use specific calculator, on site which has math calculator etc.. this is true for all other domains as well.

  17. Scott Says:

    Sebastian #10: To the best of my knowledge, neither this exam nor discussion of this exam have ever been on the public Internet before today—though anyone with any evidence to the contrary should let me know immediately!

    I don’t have exams written after the cutoff date, simply because I’ve been on leave all year to work at OpenAI! 🙂

    Also, writing these exams is hard work, and I’m not going to “waste” one full of questions that I might use for future exams.

    That said, after I write the next final exam (in December 2023) and before it’s released to the students, I could try giving that exam to GPT-4, withholding the results until after the students take the exam…

  18. Scott Says:

    ClassicalComputer #13: To expand on what I wrote in #17, if you find any place on the public Internet where even individual questions on this exam (or close enough to them) have appeared previously, please let me know! Even if we don’t say it means GPT “cheated” — again, has a human student “cheated,” if they studied so hard that they came across one or more of the same questions that later appeared on their exam? — it would still be good to know.

  19. Sean Says:

    Some of the most interesting ongoing LLM work is multiple agents given different personas who work together and critique each other’s solutions.

    I think it would be very fun and interesting to do that for this case (FeynmanGPT, PreskillGPT, etc.), and see how that impacts things. Perhaps a fun weekend project for myself…

  20. Nate Says:

    Very cool.

    Though you seem to be taking a very positive starting perspective to this… I look at all the wrong bits and think more, ‘how could it really have any accurate modelling of quantum mechanics/computation and get this so oddly wrong?’ I think its kind of telling the kind of things it did best on versus worst on for sure.

    As a teacher though I imagine you have a very good perspective on the types of mistakes a human would make in that situation too, are these mistakes common ones in your opinion?

    I really don’t think this does much to show any ‘reasoning’ going on, but really just shows how well it can do without explicit training samples that match such complex longer form computational questions (as in it seems to do worst when the question is mostly or purely computational)… which is maybe 50% at best most of the time? At least that is how this appears to me. This is kind of an obvious statement though, which you could have made at any point about any of the things GPT has done, so I don’t feel too strongly on the ‘reasoning’ here.

    Still, I do think this kind of problem, ie a test with ‘somewhat novel’ questions of a fairly sophisticated and computational nature, is very interesting to pose to something like GPT and that doing ‘better’ will be a decent metric for progress on some level.

    I never bought the whole Bryan bet, that was just his lack of understanding of the kind of representation that is going on in the GPT system at all. I think that is a common problem, and I don’t really have a 100% accurate idea of that (no one does) so it makes sense, but he never should have made such an obviously flawed bet if he had any idea his gap in knowledge. I don’t make much of his assertion of the bet or him losing it, especially considering the type of test it was.

    I do think this test at first blush will not be so simple for GPT to overcome its limitations on, though I will not be making a bet on it 😉

  21. Steven Says:

    Sean #19

    Had the same thought, which arose from the situations I mentioned in #12.

    I used the Azure OpenAI service ( as I can mess around with it for free to try this out (Used GPT-4 model). The results weren’t great from my initial experimentation, but haven’t really tried to refine things.

    I was giving it fairly novel problems my team tries to solve. For example, giving it an antenna design (phased array antenna) and trying to get it to make a smaller version of that antenna with certain specifications that must be meant (also allowing trade offs too). Tried a bunch of different personas (trying to emulate a team of engineers working on this). A lot of non sense results TBH and working with just one persona yielded much better results (not great either though)

  22. Nick Drozd Says:

    Scott, you’ve been a bit coy about your actual beliefs when it comes to the big question.

    So: do you think GPT-4 is intelligent, or what?

    Feel free to answer with nuance, grey areas, etc. Just don’t deflect by talking about how amazing it is. Yes, obviously it’s amazing. But AI doesn’t stand for “amazing invention”, it stands for “artificial intelligence”. Is artificial intelligence what it is?

    Also, have you tried using ChatGPT to generate exam questions? Can you answer all of its questions?

  23. Triceratops Says:

    Amazing result! A few questions:

    1. Based on its answers, is GPT-4’s understanding of the subject consistent (i.e. it does not randomly forget something in one question that it seemed to understand in a previous question)? Are its misunderstandings/blind spots consistent as well?

    2. If you gave GPT-4 this same exact exam 1,000 times, would you expect significant variation between scores? What would you predict the range of scores to look like (could it get lucky and break 100 a few times)?

    3. What do you think GPT-5 will score on this exam :D?

  24. Blood Soaked Berry Farm in Florida Says:


    On A.I. and this blog:

    Sometimes, I can’t escape the impression that all this trolling is being perpetrated by a rogue A.I., generating human-quality text and capable of accessing the internet. This Taylor Swift stuff, for instance—it’s so bizarre to me that all the you/Swift fanfiction stories (which seemed AI generated…) inundated the blog one or two days BEFORE it became national news that Taylor split up with her boyfriend of SIX years. Is this an extremely sophisticated rogue A.I. (which somehow hacked into Swift’s phone) trying to execute a grand plan to pair you two (I’m only being 85% sarcastic 😊).

    More seriously, I think it’s interesting that the troll attacks on this blog started at almost the same exact time that GPT-3 was rolled out. Was that the key technical innovation enabling trolls to inundate you with overwhelming amounts of fake text? Have these troll attacks opened uour mind to the possible dangers and social consequences of this technology? Just as an example, even if AI isn’t human-level smart at math or programming, it could still enable trolls to drown you in bullshit text and potentially make running this blog impossible.

    And on Swift…I continue to be suspicious of this bizarre coincidence (Swift related trolling began like a day before Taylor and Joe’s breakup became public knowledge). It seems plausible to me that some STEM person in Taylor’s team was behind the trolling (her concert tour has hundreds of staff). Or maybe it’s Taylor herself…

  25. gwern Says:

    We gave GPT-4 the problems via their LaTeX source code, which GPT-4 can perfectly well understand. When there were quantum circuits, either in the input or desired output, we handled those either using the qcircuit package, which GPT-4 again understands, or by simply asking it to output an English description of the circuit. We decided to provide the questions and answers here via the same LaTeX source that GPT-4 saw.

    Why didn’t you use the image input feature? As you work at OA in some capacity now, presumably it wouldn’t be hard to use that, and the GPT-4 paper suggests that in some cases the image inputs can substantially boost exam performance.

  26. Johanna Wilder Says:

    I am trying to ‘steelman’ the opposing argument, that these things, clever as they appear, are nothing more than highly-sophisticated probabilistic autocorrect functions. If I try to reach beyond the semantics, for their meaning, it seems that they mean to highlight the primitive nature of these decisions being made, that don’t hold a candle to the sophistication of human thought.

    That, essentially, we are mistaking the tremendous amounts of compute when applied to language for intelligence, just as we once did the ability for computers to so overwhelmingly outperform humans at mathematics.

    I think that if we were to blur out on this, we all have more in common with our viewpoints than tends to be represented in the “us versus them” nature of online debate these days.

    • These things are highly sophisticated
    • These things are still very primitive
    • We are moving too fast in the sense that these features should be considered still alpha versions, not added to production-level software so quickly.

    I do believe that sooner or later, we’re going to be in agreement, that this is either the Greatest Scam Ever or we have actually pushed AI forward in a big way. I hope this doesn’t end up like Covid-19, where we are all pushed into different identity camps. That would be a most regrettable outcome.

    Bing Chat went on in more depth, but was in enthusiastic agreement with this statement, “I agree that it would be unfortunate if we end up in different identity camps over this issue. I think we should strive for a constructive and respectful dialogue that acknowledges the complexity and uncertainty of this topic. I hope we can find some common ground and learn from each other. 😊”

  27. Mike Says:

    Although the exam itself isn’t online, you have posted lecture notes online (PDFs). I downloaded them in 2018, so they were available prior to the training cutoff date. Don’t know if they are for the same course (notes are for Intro to Quantum Information Science). (Was that before you went to UT?)

    Still, impressive for GPT-4 to pass the exam.

    It’s very weird what GPT-4 can and can’t do. I’ve been playing with primality tests, and it botches those rather shamelessly. Nor can I get it to write a Petrarchan sonnet correctly (though it’s good at Shakespearian sonnets). So it would lose some points in an exam on Renaissance century English lit. 3.5 was bad at telling whether an author was human or AI–it would often tell me that an author was AI, and then identify the quote correctly. (Tried with a number of authors from the 16th to the 20th century; haven’t yet tried this with GPT-4).

  28. Vladimir Says:

    > Extra Credit: Suppose we perform a quantum computation by physically applying a Hamiltonian H. What’s to stop us from running the computation, say, $10^{100}$ times faster, by simply replacing H by $H’:=10^{100}H$, which is also a valid Hamiltonian? Or to put it differently: is there some resource that we’d need $10^{100}$ more of to do this?

    Presumably by “applying a Hamiltonian H” you meant something like “applying exp(iHt)”? Applying H and H’ directly gives exactly the same states.

  29. Lane Says:

    Fascinating. I know the models are very complex and tracing an answer back to specific training data isn’t how they work, but are there tricks to get Chat-GPT4 to find specific comparable instances inside its training data or the rest of the internet (if anything exists outside the overlap of that Venn diagram)? Some like “True or False, and show your sources?”

  30. Scott Says:

    Nate #20: Yes, some of the mistakes are common ones, but the pattern of mistakes—in particular, doing much better on conceptual questions than on calculation questions—is wildly uncommon among human test-takers.

    In general, though, for any mistake that you see here, it’s an excellent bet that I’ve seen a similar mistake from a student.

  31. Scott Says:

    Nick Drozd #22:

      So: do you think GPT-4 is intelligent, or what?

    I think that, if it were a human, we’d clearly call the human extremely intelligent in some ways, albeit also surprisingly dumb in other ways, given its demonstrated intelligence.

    I’m sorry if that answer is still too evasive for you! But again, imagine someone asking of the first submarine: “so then, does it swim, or not swim?” Wouldn’t it trivialize the new reality to reduce it to such a question?

      Also, have you tried using ChatGPT to generate exam questions? Can you answer all of its questions?

    Great idea, I’ll try that next and let you know!

  32. Scott Says:

    gwern #25:

      Why didn’t you use the image input feature? As you work at OA in some capacity now, presumably it wouldn’t be hard to use that, and the GPT-4 paper suggests that in some cases the image inputs can substantially boost exam performance.

    I do have access to a multimodal model, but I was worried that it might be weaker than the base gpt-4 model for pure text (I’m not sure). And this is really only relevant for questions 5(b) and 5(d), which ironically are two of the ones that GPT already nailed! And once you’ve seen that GPT-4 can interact with perfect fluency in LaTeX (including the inline code for diagrams), why not just rely on that?

  33. Scott Says:

    Vladimir #28: Yes, of course “applying H” means “applying eiHt for some time t.”

  34. Schwarzschild Says:

    I think part of the take a way is how impressive the human brain actually is, and imagine if that where truly scaled that up one day.

    (This doesn’t imply GPT-4 is not impressive)

    GPT-4 sucked up all the information about QC and Mathematics on the public internet (as well as everything else on the internet basically) and scored about what an undergrad student who didn’t drop a QC class scored. The student even going through the course doing problems had “trained” on far less data than GPT 4 did (likely orders of magnitude).

    Some other question this raises, when do these kind of brute force statical AI’s hit their compute or power limits? Is it GPT 5? Do we get A’s on any well formed exams before we hit that limit and have to find new techniques? If we hit some type of compute or power limit before we do get A’s on all exams using current techniques, what that’s telling us about intelligence or the human brain general?

    Just think of how efficient the human brain is too (in terms of energy needed to learn something and data needed to learn something)

  35. Alan Says:

    Added to the list of tests demolished by GPT-4!

  36. Ariel Says:

    GPT-4 has a bajillion parameters, definitely enough to memoize a bunch of books about quantum computing. Therefore it’s less surprising than it appears that it knows how to answer the sort of questions that have answers in books about quantum computing. It’s basically working “open book”.

    The only part of the exam that I am not 99% sure has a lot of very similar material somewhere in the Internet (even if not from your course, from some other professor’s course), is QSampling.

    Given GPT-4’s success, I won’t be surprised if there is actually some material about that.

  37. Alex Says:

    Thanks for doing this Scott, I am very impressed by GPT and seeing it in action here makes its power much more real for me. I wonder how far away you think this technology is from being able to make legitimate novel research insights of its own, or if you think that is likely?

    I was very curious to see exactly how good its answers were and went through a few of the responses very carefully. I think a re-grade might come up with a slightly lower score, although in the same ballpark. For example, I think its answer to 2(a) is wrong, the trace of the density matrix it outputs is greater than 1. (However, this led it to be confused on 2(b) when calculating eigenvalues, which did not add to 1, and may have led to its mistake there, so maybe it deserved more partial credit.) In 5(c)case 2 it appears to have made two mistakes that canceled out to give it the correct answer (eg it asserts \frac{1}{4}(2)^2 = \frac{1}{2}). For 6(c), it was given partial credit, but I don’t think it was really on the right track at all, conflating n with log(n) from the very beginning of its solution, and then trying to back into the answer only at the very end. However, it does mention Grover, so perhaps some amount of partial credit is appropriate.

    Considering the above, it only truly nailed a couple of the free-response questions. I wonder if it benefits from partial credit more than a typical student? Maybe students are more likely to leave questions completely blank when they are stumped or run out of time. If you give your next exam to GPT, it would be cool if you translated its answers by hand and then gave it to the grader (without their knowledge) to grade along with the genuine exams. Then we might have more confidence on how it compares, and we can see for example how often students are leaving the questions totally blank. Regardless, I am very impressed.

  38. fikisipi Says:

    Someone on HN asked:

    I would very much like to see his analysis on which [favourite QM] interpretations correlate with high scores [on the exam].

    Come on Scott, tell us!

  39. gwern Says:

    I do have access to a multimodal model, but I was worried that it might be weaker than the base gpt-4 model for pure text (I’m not sure).

    I don’t have the page immediately handy, but as it was presented, the image+text model was always as good or better than the text-only exam scores, so that shouldn’t be an issue.

    And this is really only relevant for questions 5(b) and 5(d), which ironically are two of the ones that GPT already nailed! And once you’ve seen that GPT-4 can interact with perfect fluency in LaTeX (including the inline code for diagrams), why not just rely on that?

    It makes it more human-like and ‘fair’, in some sense. (It would also help future-proof your tests: what if it *hadn’t* gotten those right? Or you want to use it on more image-intensive benchmarks?) The human students don’t get a LaTeX encoding of the diagram which might help them; or at least, people here are already insinuating that it’s just memorizing lots of papers/books written in LaTeX, and not understanding the conceptual level, and providing the diagram rather than code would increase the plausibility of understanding – after all, even if it’s internally converting the image to the generating LaTeX, that still requires more understanding as a whole system than getting it fed to it in the original LaTeX.

  40. Ariel Says:

    I read the Graph Isomorphism problem again.

    It looks like it learned to evaluate “example”-ish quantum circuits. How cute. This definitely looks like the pattern that has been repeated in hundreds of lecture notes in its training set.

    It also makes a freshmen’s dream error in its calculation and then “magically” gets the right result out (search for \frac{1}{4}(2)^2 = \frac{1}{2}), which makes me even more suspicious that it does not actually do the calculation but rather knows the answer.

  41. J. Says:

    These LLMs are excellent students. Some ppl. may want to marry them.

  42. Ariel Says:


    GPT-4 is obviously computing something. It’s a very large matrix arithmetic circuit. It’s doing a bunch of computations.

    My experience with non-black-box, non-deep-learning programs, especially the sort of programs with a ton of implementation details, is that 90% of the time, when they do something impressive, then when you look inside the black box, you see that it is actually a cheap trick that someone put into it.

    That’s why my prior for a DL algorithm is also “it does a bunch of tricks”. But also because I expect gradient descent to be even more willing to find these sorts of tricks than human programmers.

  43. gwern Says:

    Nor can I get it to write a Petrarchan sonnet correctly (though it’s good at Shakespearean sonnets).

    Mike: it doesn’t seem too hard to get most of a Petrarchan sonnet ( I’ve been experimenting with an alliterative translation of Milton’s Perished Paradise, you might say, and GPT-4 is very happy to do inner-monologue-style editing of poetry where it revises a poem repeatedly for meter/rhyme/quality. Seems to improve it a lot. (And can be quite funny to watch it revise a poem, or a program, like 8 or 10 times before it finally gives up and sometimes tells you huffily that “the quality of this poem is very high already”.)

    In this case, when I try a sonnet on quantum computing and revise it, I wind up with variants on

    In realms where quantum mysteries entwine,
    Entangled states reveal the worlds unknown,
    A revolution stirs, as seeds are sown,
    Unlocking powers that challenge laws divine.
    Amidst the dance of zeros, ones, and chance,
    Unbounded prospects shatter chains of yore,
    Transcending limits known to us before,
    A future gleams with promise to advance.

    Yet caution whispers softly, warns of fate,
    Of shattered codes that breach the walls of trust,
    Unchecked, a force that could engulf us all.
    In disarray, our fragile world might fall.

    Embrace the quantum age with wisdom's grace,
    For with great power, great duty we must face.

    This doesn’t fit the ABBAABBA CDECDE rhyme scheme GPT-4 claims it does, even when it goes line by line and explains the rhyme pairs. So I think this may be another BPE thing ( it’s not sufficiently sure of what rhymes *are* to be able to self-correct adequately. You can get an idea of the stubbornness of the rhyme errors from my screenshots: (haven’t found a better way of copying the sessions as of yet because the ‘chat’ interface doesn’t let you copy the whole thing the way the original playground does…)

  44. Steve E Says:

    Relatedly, Terry Tao describes how he is using GPT4 in his mathematical research and discusses where it is and isn’t helpful, here:

    Scott, did you use the non-RLHF version of GPT4 for this test, or did you use the RLHF’d version? I ask because I read in the “Sparks of AGI” paper that MSFT used the non-RLHF’d version and found it did better on reasoning stuff.

  45. Scott Says:

    Ariel #36:

      GPT-4 has a bajillion parameters, definitely enough to memoize a bunch of books about quantum computing. Therefore it’s less surprising than it appears that it knows how to answer the sort of questions that have answers in books about quantum computing. It’s basically working “open book”.

    For what it’s worth: I just checked, and the students who took this exam were allowed to bring 4 pages of handwritten notes. (Other years I’ve done it completely open-book—especially when everything was remote during covid.)

    Of course, 4 pages of notes << the Internet. In my view, though, a language model that actually worked “open book” would look more like Bing Chat. The fact that nearly the whole public Internet is used at training time doesn’t mean, either in theory or in practice, that GPT still has “random access” to the whole public Internet at test time—if it did, then hallucinations wouldn’t be such a big problem!

    Meanwhile, we could list other factors that seem to advantage the students over GPT:

    (1) GPT “has no eraser”: at least the way I was using it here, it had to generate its answers immediately, token-by-token, and could not go back to revise its work, or even generate any scratch-work that wasn’t part of the graded solution.

    (2) The students had 3 hours, whereas GPT-4 surely completed the entire exam in under 10 minutes (not that I timed it)!

  46. Scott Says:

    Steve E #44: I used the gpt-4 chat model in Playground, which I believe is RLHF’ed.

  47. Ted Says:

    Oddly enough, economics professor Steve Landsburg at the University of Rochester gave GPT-4 his undergraduate economics exam, and it scored a terrible 4/90 (which he suggests is far below the average score for his students): I wonder what explains the huge discrepancy with its performance on your and Bryan Caplan’s exams.

  48. Schwarzschild Says:

    Scott #45

    I actually don’t grok how having an eraser or doing it in a shorter time than humans makes sense here given the way GPT comes up with the answers.

    You would seem to imply one could make it so GPT can realize mistakes as it is outputting tokens, which AFAIK there is no mechanisms for. Same with the the time constraint, this would imply it’s doing something more than running the inputs through the weights (although maybe humans are just super slow GPUs).

    It’s almost as if you are saying “students have the advantage of thinking” which is why erasing happens and students may take longer.

  49. JimV Says:

    Ariel at #36 says, “It’s basically working “open book”.”

    Only in the same sense as a human who has read a book refers to his or her memory of the book, not the actual book. As I understand it, all its material has been processed into numerical weights, not stored directly in some working memory. Granted, it remembers those weights better than humans might remember their neuron-processing results.

    As a student, I worked out exam problems by trial calculations then checked them for errors. Without the checking process I would have done much worse (and if I had spent more time checking I would have done better).

    According to a neuroscientist whose blog I read, they have formal definitions for intelligence, sapience, and sentience, and GPT definitely meets the definition of intelligence, although not yet demonstratively the other terms. That agrees with my naive feeling that intelligence means the ability to solve problems. (Not necessarily all problems.)

  50. Scott Says:

    Schwarzschild #48: I mean, it’s well-established by now that simple “prompt engineering” hacks, like writing “let’s think this through step by step” or “please spell out all your reasoning before outputting a final answer,” can greatly enhance GPT’s performance on math and reasoning problems. But I didn’t do any of that stuff when I gave GPT-4 this exam.

    More pointedly: I should tell you that people in the OpenAI Slack are already brainstorming how to boost GPT-4’s performance on my quantum computing final exams up to A or A+, for example via few-shot prompting. These are the same people who, with a couple months of effort, got GPT-4 to nearly ace AP Biology exams—the achievement that turned Bill Gates into an ardent LLM believer. Would you like to place a bet that they won’t succeed on my exams also? 😀

  51. Scott Says:

    EVERYONE: Following the superb suggestion of Nick Drozd #22, I tried asking GPT-4 to generate its own quantum computing final exam. Here was the prompt that I used:

      Please write five problems for the final exam of an undergraduate Introduction to Quantum Information Science course. Here is a course description:

      This course is a first introduction to the theory of quantum computing and information. It covers the rules of quantum mechanics (qubits, unitary transformations, density matrices); quantum gates and circuits; entanglement; the Bell inequality; protocols for teleportation, quantum key distribution, quantum money and other tasks; basic quantum algorithms such as Shor’s and Grover’s; basic quantum complexity theory; and the challenges of building scalable quantum computers. Previous exposure to quantum mechanics is not required. The prerequisites are linear algebra and some exposure to classical algorithms and programming.

      The problems should vary in difficulty, but a strong undergraduate should be able to complete them in approximately a half hour each. They should also be straightforward to grade.

    And here is GPT-4’s final exam:

      Problem 1: Qubits and Unitary Transformations

      Consider the following single-qubit state:

      |ψ⟩ = α|0⟩ + β|1⟩

      where α and β are complex numbers such that |α|^2 + |β|^2 = 1.

      (a) Apply the Pauli-X gate to the state |ψ⟩ and find the resulting state.

      (b) Apply the Hadamard gate to the state |ψ⟩ and find the resulting state.

      (c) Given the state |ψ⟩, find the probabilities of measuring the state in the |0⟩ and |1⟩ basis.

      Problem 2: Quantum Gates and Circuits

      Consider the following quantum circuit: [GPT DIDN’T ACTUALLY GENERATE ANY CIRCUIT HERE]

      1. Apply a Hadamard gate to qubit 1.
      2. Apply a CNOT gate with qubit 1 as the control and qubit 2 as the target.
      3. Apply a Hadamard gate to qubit 1.

      (a) Write down the matrix representation of the entire circuit.

      (b) If the input state is |00⟩, what is the output state of the circuit?

      (c) If the input state is |01⟩, what is the output state of the circuit?

      Problem 3: Entanglement and Bell Inequality

      (a) Write down the Bell state |Φ^+⟩.

      (b) Show that the Bell state |Φ^+⟩ is maximally entangled by calculating the reduced density matrix for one of the qubits and showing that it is maximally mixed.

      (c) Calculate the expectation value of the Bell-CHSH operator for the state |Φ^+⟩ and show that it violates the Bell inequality.

      Problem 4: Quantum Protocols

      (a) Briefly describe the quantum teleportation protocol and explain how it can be used to transmit an unknown quantum state between two parties.

      (b) Briefly describe the BB84 quantum key distribution protocol and explain how it can be used to securely exchange a classical key between two parties.

      (c) Explain the concept of quantum money and how it can be used to create secure and unforgeable currency.

      Problem 5: Quantum Algorithms

      (a) Explain the main idea behind Shor’s algorithm and how it can be used to factor large numbers efficiently.

      (b) Explain the main idea behind Grover’s algorithm and how it can be used to search an unsorted database more efficiently than classical algorithms.

      (c) Briefly describe the concept of quantum complexity theory and how it relates to the study of quantum algorithms and computational problems.

    Bottom line, I probably won’t have GPT-4 write my exams in the immediate future. 😀

    Problem 1 is noticeably easier than anything on my exams. On reflection, though, it actually wouldn’t be a bad idea to include such trivial questions—e.g., to distinguish the D- students from the F students, while also building up the other students’ self-confidence!

    Similar comments for Problem 2 (with some suitable quantum circuit inserted).

    Meanwhile, Problems 3, 4, and 5 seem too much like recapitulating material that’s right in the lecture notes. (Though I don’t think I ever actually used the notation |Φ^+⟩ for the Bell state, or formally defined the “Bell-CHSH operator.”) And Problems 4 and 5 would be very hard to grade, since how much detail is enough?

    None of the questions do what I like to do, which is to introduce a novel situation—especially, one that combines two or more concepts in a way the students have never seen before—and then ask them to analyze it.

    Though, now that I say that, I should see what happens when I explicitly ask GPT to generate the kind of questions I like!

  52. Scott Says:

    Just tried the following prompt:


      True or False? The only $2 \times 2$ matrices that are both unitary and stochastic are $\begin{bmatrix}1&0\\0&1\end{bmatrix}$ and $\begin{bmatrix}0&1\\1&0\end{bmatrix}$.

      Please spell out your reasoning step-by-step before stating your final answer.

    The result:

      Let’s first define what it means for a matrix to be unitary and stochastic.

      A matrix is unitary if its conjugate transpose is equal to its inverse. In other words, for a matrix $U$, it is unitary if $U^{\dagger}U = I$, where $U^{\dagger}$ is the conjugate transpose of $U$, and $I$ is the identity matrix.

      A matrix is stochastic if all its entries are non-negative and the sum of the elements in each column is equal to 1.

      Now let’s consider a general $2 \times 2$ matrix:

      U = \begin{bmatrix}a&b\\c&d\end{bmatrix}

      For this matrix to be unitary, we must have:

      U^{\dagger}U = \begin{bmatrix}a^*&c^*\\b^*&d^*\end{bmatrix}\begin{bmatrix}a&b\\c&d\end{bmatrix} = \begin{bmatrix}1&0\\0&1\end{bmatrix}

      Expanding this product, we get:

      \begin{bmatrix}a^*a + c^*c&a^*b + c^*d\\b^*a + d^*c&b^*b + d^*d\end{bmatrix} = \begin{bmatrix}1&0\\0&1\end{bmatrix}

      This gives us the following equations:

      1. $a^*a + c^*c = 1$
      2. $a^*b + c^*d = 0$
      3. $b^*a + d^*c = 0$
      4. $b^*b + d^*d = 1$

      For the matrix to be stochastic, we must have:

      5. $a, b, c, d \ge 0$
      6. $a + c = 1$
      7. $b + d = 1$

      Now let’s analyze these conditions:

      From (6) and (7), we can write $c = 1 – a$ and $d = 1 – b$.

      Substituting these expressions into (1) and (4), we get:

      8. $a^*a + (1-a)^*(1-a) = 1$
      9. $b^*b + (1-b)^*(1-b) = 1$

      Expanding (8) and (9), we get:

      10. $a^*a + a^* – a + 1 = 1$
      11. $b^*b + b^* – b + 1 = 1$

      Simplifying (10) and (11), we get:

      12. $a^*a + a^* – a = 0$
      13. $b^*b + b^* – b = 0$

      Now let’s consider (2):

      14. $a^*b + (1-a)^*(1-b) = 0$

      Expanding (14), we get:

      15. $a^*b + a^* – a – b + ab = 0$

      Now, we know that $a, b, c, d \ge 0$ from condition (5). Therefore, $a^* = a$ and $b^* = b$.

      Substituting this into (12), (13), and (15), we get:

      16. $a^2 + a – a = 0$
      17. $b^2 + b – b = 0$
      18. $ab + a – a – b + ab = 0$

      Simplifying (16), (17), and (18), we get:

      19. $a^2 = 0$
      20. $b^2 = 0$
      21. $2ab – a – b = 0$

      From (19) and (20), we know that $a = 0$ or $a = 1$, and $b = 0$ or $b = 1$.

      Now let’s consider the possible cases for $a$ and $b$:

      Case 1: $a = 0$ and $b = 0$

      In this case, from (6) and (7), we get $c = 1$ and $d = 1$. So, the matrix is:


      This matrix is not stochastic because the sum of the elements in the first column is not equal to 1.

      Case 2: $a = 0$ and $b = 1$

      In this case, from (6) and (7), we get $c = 1$ and $d = 0$. So, the matrix is:


      This matrix is both unitary and stochastic.

      Case 3: $a = 1$ and $b = 0$

      In this case, from (6) and (7), we get $c = 0$ and $d = 1$. So, the matrix is:


      This matrix is both unitary and stochastic.

      Case 4: $a = 1$ and $b = 1$

      In this case, from (6) and (7), we get $c = 0$ and $d = 0$. So, the matrix is:


      This matrix is not stochastic because the sum of the elements in the first column is not equal to 1.

      So, the only $2 \times 2$ matrices that are both unitary and stochastic are:




      Therefore, the statement is True.

  53. Scott Says:

    BUT: When I ask it whether it’s true or false that the identity is the only 2×2 matrix that’s both unitary and stochastic, and to show its reasoning before answering, it goes through a similarly long argument before again saying that the statement is true!

  54. Schwarzschild Says:

    Scott # 50

    So, if you did few-shot prompting, you’d have to do something like that for the students, no? Then re-score and re-compare.

    I would not place that bet! 🙂

    This stuff is too empirical and hard to just arm chair reason about.

    Anyway, my point is more we can for example imagine a student kind of going through the internet and collecting more and more examples of QC samples to “even” the playing filed for humans vs GPT as some have suggested in the comments.

    I don’t see how in the techniques GPT uses (at least before they started hiding things 🙂 ) you compensate for the erasing situation that students do and if it could (check itself as it goes) we’d likely see a reduction in hallucinations.

  55. Scott Says:

    Schwarzschild #54:

      So, if you did few-shot prompting, you’d have to do something like that for the students, no? Then re-score and re-compare.

    The students had few-shot prompting. It was called “attending my lectures and doing the homeworks and midterm and practice exams”! 😀

  56. Schwarzschild Says:

    Scott #55

    Something seems off there.

    GPT-4 Also almost certainly had access to all your lectures and your book (they are online) and it would not surprise me having access to previous exams (like your students). Plus it had access to 1000s of other examples and lecture notes.

    The few prompt engineering could be a type of “memory bootstrapping” that humans seem to do (just an analogy).

    To make things more “fair”, you’d need to do something on your exams like “Hey, to brush up your memory, here are some examples before answering the following question”

  57. Nepeta Says:

    The way you’re making GPT-4 answer the true/false questions has an obvious flaw/feature that makes GPT-4 *weaker* than it otherwise could be. Namely, you’re forcing it to output True/False as the first word of the answer, meaning it has to get the correct answer in a single pass.

    The explanation after the first word of the answer is simply a “rationalization” of the answer that’s already locked in, and can’t change the answer.

    If instead you made it so that True/False is the *last* word of the answer, it would have “time to think” while generating the explanation, and would likely be correct more often.

    Bryan Caplan’s exam had the same issue.

  58. Nepeta Says:

    Another simple thing that could enhance GPT-4’s performance was if you gave it the ability to use an “invisible scratchpad” where it can ponder the correct answer using chain of thought, whose contents are not included as part of the final answer. (It would take some thought to design a prompt that would make it do so, but it should be possible.)

  59. Scott Says:

    Nepeta #57, #58: Please read the comments—we’ve been discussing exactly those issues!

    In particular, I just tried forcing GPT to output its reasoning before answering for one of the true/falses that it got wrong, and that indeed caused it to get the right answer … but then the same “reasoning” caused it to output the wrong answer to a slight variant of the question! Will continue trying (or you’re welcome to do so).

  60. Raoul Ohio Says:

    Scott #8, and everyone:

    If you get a phone call that wants to sell you an extended warrenty on your car, it is best to not respond at all, just hang up.

    Likewise, it is a mistake to respond to clowns like BogBoy #7 — it just encourages them to continue spamming you. Just block them without any acknowledgement and go on with your life.

  61. Daniel Reeves Says:

    Holy hobgoblins, this is gobsmacking. I feel like I’ve been saying that every other day since Google’s PaLM paper, which is when I first noticed that LLMs could do things that, until that point, I’d been completely convinced they’d never be able to do.

    I used the gpt-4 chat model in playground, with a temperature of 0.2 and a max length of 1930 tokens.

    Any guidance on choosing these parameters? Looks like the defaults are 0.7 and 256 and I don’t have much sense of why/when I might want to change them.

    PS: Ok, the obvious thing finally occurred to me. Here’s GPT-4’s response to the above (using Scott’s settings of temperature 0.2 and max length 1930):

    The temperature and max tokens parameters in the GPT model can be adjusted to control the output’s creativity and length, respectively. Here’s a brief explanation of each parameter and some guidance on choosing their values:

    1. Temperature: This parameter controls the randomness or creativity of the generated text. A higher temperature (e.g., 0.7 or above) will produce more diverse and creative outputs, while a lower temperature (e.g., 0.2) will make the output more focused and deterministic.

    – If you want more conservative and coherent responses, choose a lower temperature (e.g., 0.2).
    – If you want more diverse and creative responses, choose a higher temperature (e.g., 0.7 or higher).

    2. Max tokens: This parameter controls the maximum length of the generated text. The default value is 256 tokens, which is usually sufficient for most use cases. However, you can increase or decrease this value based on your requirements.

    – If you need a longer response or have a more complex prompt, you can increase the max tokens value (e.g., 512, 1024, or even 1930).
    – If you want shorter responses, you can decrease the max tokens value (e.g., 128 or lower).

    Keep in mind that increasing the max tokens value may lead to higher API usage and longer response times. It’s essential to strike a balance between the desired output length and the API usage.

    In summary, adjust the temperature and max tokens values based on your desired output’s creativity and length. Experiment with different values to find the optimal settings for your specific use case.

  62. Haelfix Says:

    Scott, how much work is OpenAI doing in finetuning their LLMs for various applications (like quantum computing). One of the obvious areas of improvement would be feeding it textbooks in certain subjects, and then scrumming through the high citation arxiv papers. Or is their compute very much confined to creating the most general models possible?

    The next thing to try is to implement it with a proper plugin, like Wolfram and then repeat the finetuning to see if that helps or not.
    Finally, a sort of reasoning loop (like autogpt) where it starts trying to subdivide problems into simpler pieces, perform unit/consistency checks, and then revise the answers. Right now it answers questions as if it was someone’s off the cuff intution, without the benefit of detailed and methodical analysis (which its capable of in principle).

  63. OhMyGoodness Says:

    There was a guy in college that was pretty good mathematically (I think like a 30-40 on Putnam. I don’t know why but on a Complex Variables test his answer for the last problem was “F**k You”. He didn’t receive partial credit but, as far as I know, a truly original response in the history of Complex Variables tests.

  64. Bruno Says:

    Scott, any comment on Alex’s #37? GPT got full credit for a completely wrong answer 2(a) to the short-form questions. The density matrix it starts with is [1001,0110,0110,1001]/3, which does not have rank 1, and has trace 4/3. AFAICT tracing out the first qubit in this wrong density matrix should give (2/3)[1,0;0,1] but the result is yet something else. This should probably receive very partial credit, like 1/5 for knowing some of the steps to perform. Unless what it taken into account is only the final result, which is only slightly off: should be [2,1;1,1]/3 as far as I can tell.

  65. Shion Arita Says:

    I’ve made some sweeping comments about these NN based AI systems over the past few posts here, hashing out my thoughts and position(which seem to be unique to me so far), and I’ve just now come to a further refinement of them.

    I think that LLMs like GPT4 specifically, and NN AI most generally, are not AGI, and by themselves problably won’t be, but they are very powerful, significant, and context-breaking systems that are the new big thing of the near future, and will transform the world as much as things like machinery, electricity, computers, and the internet did. They are very powerful optimizing objects that can do things that machines couldn’t do before. But no I don’t think they have true intelligence or understanding.

    The problem is, I don’t think there really exists a good single test of true intelligence or understanding. I don’t think it’s possible for a couple of reasons. Firstly, for any particular task you have, it seems like it’s possible to make a ‘narrow’ system that can perform that task. Secondly, I can’t think of a question that almost any human would get right and GPT4 would get wrong. On face, this might sound like an argument for the intelligence of GPT4, but it’s really a statement about the inadequacy of questioning, and in particular, the inadequacy of many typical humans. To illustrate the problem more clearly, some things I was thinking about for true understanding tests were things like “say something that is very surprising, but ends up being right,” or, “solves somehting in a way that no one else has ever solved it.” if it could do these things (which I don’t think it can), I might say it was intelligent, but many humans would fail on these criteria. In fact almost all humans would fail on these critera on an observation window short enough to be reasonable– even the smartest and weirdest person who has ever existed probably does not spend most of their time doing things that are profound. And most humans never do anything that’s really profound. Such a test can’t be simple; there is no simple thing which has no simple system that can do it, and there’s nothing non-profound that can’t be faked or precomputed in advance. So we’re kind of stuck. If we weren’t humans ourselves, and rather were aliens whose cognition was very different from our own, and we were given a typical human to study, without being able to see the achievements of humanity, I think we’d have a very hard time determining whether humans were truly intelligent or conscious or not. I think we only really know because we are able to directly observe our own consciousness, self-awareness, and thought process. and since we know that other humans are structured very similarly to ourselves, we know that they must be like this too. Take that symmetry via similarity away, and we’re kind of in the dark. Or at least, we’re left with having to do a lot of complicated tests, a lot of back and forth with the system, doing an evaluation like this is as complicated as anyhting can be, and thus will seem subjective with room for debate. As someone said replying to me in another thread, sometimes the goalpoasts really were in the wrong place the whole time.

    So what is GPT4 (and other similar systems), and how do I know that?

    It’s not a ‘stochastic parrot’ I don’t think. I know I said that about lamda after the Blake Lemoine thing, but I have changed my view in the intervening time after being able to actually use these systems myself and experiment on them in a serious way. It is something different, and I don’t think we have a good word for it. I tried to come up with one and failed, becuase we don’t really know what’s going on in there in the first place.

    I think the biggest mistake that people make with regards to NN AI systems is seeing the powerfully optimized results, seeing the complexity and capability of what it does, along with the msyerious nature of exactly what these things are learning in there and how they are doing what they do, and thinking things like, “it must actually understand things like quantum computing, because there’s no other way it could be doing this, getting these questions right.”

    I assert that “there’s no other way it could be doing this” is in fact false (!)

    Like so many things, This has been staring us in the face since the dawn of time, but it takes a specific insight to see the significance of what we were looking at. There are currently two strongly optimizing forces that are at play on earth: evolution by natural selection and human cognitive intelligence. Human-made electronic computers are rapidly rising as a third ‘great power’ but aren’t quite there yet. But let’s focus on the first power: evolution by natural selection.

    Humans first looked at the complex and heavily optimized objects they saw in the world; themselves and other organisms, saw the optimization and complexity there, and concluded, nay, took as axiomatic, that they had come into being via a force of deliberate intentionality, that was aware of the significance of what it was doing when it created them. This made sense, because, look at these things! look how optimized and purposeful they are. They must have been made by something that understood the meaning of what it was doing. There’s no other way it could have been done.

    Except, it in fact turns out that there IS another way it could have been done, and that’s exactly what happened. And that other way, evolution by natural selection, does not have any awareness or intentionality, and it does not understand the meaning of what it is doing. And yet, And yet *shockingly* I might add, here are these complex, optimized results showing apparent purposefulness. Darwin figured it out, and to his credit, he figured it out before we had the tools to observe in detail how biological systems actually were structured, and how they were working.

    In other words, we already know that this can happen. And I think it’s pretty clear that it did happen here, albeit in a different way.

    This also happened with things like chess with classical computation. In the past, people thought that being able to achieve top-human level performance at chess required a true AGI. When Deep Blue defeated Kasparov, the lesson learned was that this was incorrect. Belief in the significance of this was (the fact that it didn’t need to be intelligent, not the fact that it beat him) was strongly tempered by the fact that unlike biological systems, and unlike NN AI systems, us humans actually understood very well how Deep Blue really worked, what it was actually doing to achieve its esults, and why its algorithm would be effective at playing chess. The difference is that with things like biological systems and NN AI systems, their inner workings are not readily comprehensible by humans, and thus there is a lot of room for ‘god of the gaps’ or ‘AGI of the gaps’ type of arguments to creep in.

    Before protein structures were elucidated via x-ray crystallography, it was widely believed that they would be very regular, highly symmetrical structures with clockwork like elegance in form and function. When the first structure was revealed, it turns out they are random-seemeing spaghetti curves. Max Perutz, who elucidated this first strucutre said of his descovery, “Could the search for ultimate truth really have revealed so hideous and visceral looking an object?”

    Is natural selection ‘intelligent’? I have to say no. And that’s because even though what it does is heavily optimized, and complicated, it doesn’t understand the meaning of what it does, or furthermore, there even is no ‘it’ there to even have the potential do the understanding. It’s optimization without an optimizer. In an earlier post I said, about the AI, that we’re in the Blindsight timeline, but haven’t we kind of been in the Blindsight timeline this whole time?

    I think that we will see these non-intelligent but strongly optimizing AI systems continue to do more and more surprising things, things previously believed to require true AGI, while remaining lacking in certain areas (for example, areas where it needs to perform an exact sequence of computational steps to achieve a very specific result, areas where it has to reproduce complicated experimental results that came from our interaction with the outside world, and areas where it needs to innovate and decide what is or is not significant). For example, as Nick Drozd #22 said: “Also, have you tried using ChatGPT to generate exam questions? Can you answer all of its questions?” I wrote the following as my advance prediction after seeing Nick’s comment and before seeing Scott’s attempt: “This is the kind of thing where I expect it would kind of break down. I predict that it will create a mixture of ‘generic/trivial’ questions and ‘flawed/bad’ questions.” I think my prediction was pretty much right.

    I think that these NN AI systmes are not like natural selection in a lot of ways, and they’re also not like traditional computation, and they’re ALSO not like cognitive intelligence. And all of these things are separated by vast distances in the space of what systems can be like. They’re somehting else, something pretty far away from anything else, but maybe along a vector halfway between things like natural selection and traditional computing. I’m not really sure though.

    Why do I think that these systems’ limitations are meaningul when compared to humans? These NN AI systems work by being able to exploit certain regularities in the training data(you could argue that any optimization has to exploit patterns in data but I think humans are much more deeply general at this than things like gradient descent can be. These large training runs are brute forcing the learning in a way that humans have a real efficiency), and they way they exploit them I suspect would seem kind of dumb and inefficient if a human could do all the bookkeeping required to understand what it was really doing.

    In conclusion, I think the lesson to learn here is basically ‘oh, it didn’t need to be intelligent to do these amazing things. But there are lots and lots of amazing things that you don’t need to be intelligent to do, and this is surprising, but it seems true. So we should expect this trend to continue; more and more results without the intelligence behind them’. I don’t think I’m using some weird definition play on the word ‘intelligence’ because I think there’s something concrete and real to intelligence which is hard to describe verbally but has a lot to do with vague concepts like ‘understanding’ and ‘meaning’.

  66. Prasanna Says:

    What was revealing from Sebastien Bubeck’s talk at MIT was that the publicly available GPT-4 is a dumbed down version due to safety reasons, and he also mentioned the base model is more powerful. This raises several questions
    1. Is GPT-4 a B and not A due to this ? If not, do we know of any reasons to believe that ?
    2. What has GPT-4 truly achieved beyond what is “visible” to the public ?
    3. Are the current limitations like calculations errors and others due to this safety considerations constraining GPT-4 in some form ?
    3. Looks like few people even within OpenAI will have access to this powerful model, notwithstanding even more powerful models of GPT-5 in development ?
    4. Note that top executive team of OpenAI is liberally dropping the word AGI in their public talks , and even talking about getting the society “gradually” ready for it ?

  67. fred Says:

    Hurrah to OpenAI!

    But it’s quite puzzling that there’s no consideration in this post on what this means for the “students” who are only mentioned as a mere benchmark against the AI.

    As much as Feynman’s “nobody understands quantum mechanics!” has been a confidence boost for generations of students, this is quite the extreme opposite!

    But, rejoice, soon enough the AI will no longer be just a toy that answers some evaluation tests but it will create and manage the tests itself… (e.g. what’s the point for Amazon to stick with its current hiring testing protocol when the AI can do better than all humans in a fraction of the time?)
    It won’t take long for OpenAI to offer a super affordable alternative to the university path, allowing students to study with the AI, be evaluated by the AI, encouraged by the AI (24/7, with infinite patience), and be rewarded by the AI with a shiny diploma.
    The death of student debt is finally in sight!

  68. Matt Says:

    I disagree with the 5/5 mark for Q2a. GPT’s $\rho$ and $\rho_2$ are not even a density matrix (they have trace 4/3).

  69. Sabine Says:

    I’m just here to complain that a vector doesn’t have a dimension, its vector space has.

  70. Tomislav Says:

    Hello, I’m just a beginner in quantum computing so please don’t spit on me if I’m wrong, but my experience is a bit weirder, which I described in my blog.

  71. space2001 Says:

    So far so good artificial intelligence imitators..

    Can ChatGPT-N ever invent quantum mechanics? No? I did not think so either 🙂

  72. Raoul Ohio Says:

    OhMyGoodness #63:

    He was evidently a Fortran programmer.

  73. fred Says:

    It’s striking that we still have a very poor understanding of how the human brain actually works, but that didn’t stop us from emulating it using a system that’s just as inscrutable!
    Is this a coincidence?

  74. Scott Says:

    space2001 #71: Could you have invented quantum mechanics? I didn’t think so either 🙂

  75. Scott Says:

    Sabine #69: My entire professional life, people have without confusion used “dimension of a vector” to refer to the dimension of the relevant vector space in which the vector resides. (Also known as: how many scalars you need to represent this vector in your computer’s memory.) Were they not supposed to do that, maybe not unless the vectors came from the Dimension region of France? 🙂

  76. OhMyGoodness Says:


    Actually an ME (I just checked and he had a triple Bachelors ME, Math and Comp Sci) and after graduation he did write geophysical interpretation software for oil and gas exploration and did well I believe. He was originally from Ohio so you have that apparently in common.

  77. OhMyGoodness Says:


    Did you know this story?

  78. Scott Says:

    fred #73:

      It’s striking that we still have a very poor understanding of how the human brain actually works, but that didn’t stop us from emulating it using a system that’s just as inscrutable!
      Is this a coincidence?

    Not a coincidence! When people understand how something works, they don’t think of it as “AI.”

    (Interesting corollary: when skeptics say that GPT is not a “true AI,” is the issue that they imagine they understand it much better than they do? I.e., that they say “oh, it’s just a glorified probabilistic autocomplete,” but have never grappled with the enormity of the internal representations of real-world concepts you’d need to have learned in order to do autocomplete this well?)

  79. Qwerty Says:

    What does this mean for us all?! What can GPT still NOT do?

  80. fred Says:

    Qwerty #79

    “What can GPT still NOT do?”

    it doesn’t always do well when dealing with new “languages” (like rules of string manipulation, proto computer languages) it wasn’t trained on, no matter how simple they are. But I guess here it’s a matter of training it more so that it gets better at generalization and pure logic.

    Also if you ask it to do some reasoning based on some random pick that it can’t write down explicitly, it struggles to basically “remember” what its pick was consistently (because it’s all implicit in the prompt and since it’s all probabilistic, it’s not always re-deriving his pick correctly). Which is why it struggles to reason on or manipulate its own answer before it writes it down (like reversing strings).
    That’s also why it always help to ask it to break down any reasoning into very granular steps it writes down in the prompt, this anchors the whole chain of reasoning way better. We humans do the same when we reach for a notepad to anchor our thinking, especially when we’re exploring/being creative.

  81. fred Says:

    For noobs like me who may be confused by the bickering/trolling from Sabine #69 about vectors, vector spaces, dimension, etc.

  82. Scott Says:

    Qwerty #79: We’ve already seen examples in this thread of things that GPT still can’t do. E.g., it flubbed many of the harder problems on this exam, and it seemed unable to generate its own halfway decent exam (even when I specifically asked it to invent more creative problems).

    The harder question is what it still won’t be able to do a couple years from now—when there will not only be much more powerful models, but also much more seamless integration with the Internet and tools (internal scratchpads, etc) to help remedy all of its current weaknesses.

  83. Raghu Parthasarathy Says:

    I had GPT3.5 (not even 4!) take my “Physics of Energy and the Environment for non-scientist” final exam; it ranked 22 out of 80, and did better on short answer than multiple choice questions!

    I just posted about this here, , and I also comment on how GPT-x may be useful for assessing exam quality, revealing ambiguous question through the rare things it gets wrong.

  84. Steven Says:

    Scott #82

    What is the current scaling limits using the current AI techniques (neural nets at its core)? For example, if we need to increase by 2 or 3 orders of magnitude the parameters to ace all reasonable exams, how does the compute and energy resources scales? I’ve heard it is quadratic in terms of the parameter size. Also how does the amount of data scale as well to increase performance (if we know)?

    In terms of using other tools, I’m really interested see how that can enhance things. Has some of this stuff been formalized in attempt to project out or are we more in “extremly hard to prove, let’s just empirically see what happen”

  85. Gil Kalai Says:

    Amazing post!

    It would have been nice to see if GPT can think on its own (or with some useful prompts) about surprising connections like the connection between computation and quantum mechanics. (E.g. could GPT invented quantum computation had it existed in 1970?)

    There are quite a few connections in science and mathematics that after you see them they seem easy and sometimes only involve making a connection between two separated areas. It is not unlikely that GPT would be good in offering such connections.

  86. Scott Says:

    Alex #37, Bruno #64, and Matt #68: You’re absolutely right—and while Justin graded 2(a), I take responsibility for the oversight and am sorry. I’ve downgraded GPT to 1/5 for that problem, same as I’d give a student who set up the right kind of calculation but then produced a totally wrong answer. This decreases the exam score from 73/100 to 69/100, which is still a B but no longer a “solid” B (some more lost points could move it to B-).

    Any other regrade requests? 😀

  87. JimV Says:

    Shion Arita Says: “But no I don’t think they have true intelligence” (Comment #65)

    As I commented previously, neuroscience has working definitions of intelligence, sapience, and sentience, of which GPT-4 is thought to have demonstrated intelligence conclusively, but not the other two. I suspect you are using your own semantics which combines the three, which is natural, but leads to semantic arguments.

  88. Scott Says:

    Gil Kalai #85: I was just thinking about how I’d love to interact with a language model that was trained, let’s say, only on text before the year 1970. There are so many recent developments to which one could ask for its reactions—not quite like having a time machine, but some of the same functionality.

    That said, how exactly would we check whether such an LLM could “invent quantum computation”? Would we prompt it with the basic idea, and then see if it could take it any further?

  89. Scott Says:

    Steven #84: It’s complicated. Keeping the training data fixed, you can vary (among many other things) the depth of your neural net, the width, and the number of steps of gradient descent. Alas, we don’t really know theoretically how any of this stuff affects the real-world performance: almost everything we know is empirical. And then there’s an algorithmic issue specific to transformer models, which is that their runtime scales quadratically with the length of the context window. A linear-time algorithm, if discovered—even one that just approximately preserved the performance—would make it much easier to have GPT, e.g., ingest a whole book at execution time.

  90. fred Says:

    For humans, “Attention and Sleep is All You Need”.

    So, extrapolating on the similarities between the human brains and LLMs, I wonder if there could be any connection between dreaming and LLMs hallucinations. 🙂

    It’s interesting to note that severe sleep deprivation in humans leads to hallucinations and interesting cognitive changes:

    “On the eleventh day, when he was asked to subtract seven repeatedly, starting with 100, he stopped at 65. When asked why he had stopped, he replied that he had forgotten what he was doing”


  91. Hans Holander Says:

    Yes, GPT-4 has absorbed the entire internet and can answer similar questions reasonably well, including common mistakes. Then you ask novel questions and it gets 4 out of 90 points, and the professor notes: “The AI had no problem reading the equations; all of its errors are due to fundamental misunderstandings of basic concepts.”

    Time to see LLMs for what they are. Namely, LLMs. 🙂

    s/ Hans Holander

  92. fred Says:

    When asking GPT to do that task of “subtract seven repeatedly, starting with 100”, it takes a lot of contortion to get it to finally print it only in reverse order (something that’s near impossible for it to do correctly, obviously).
    subtract 7 from 100, repeatedly, but don’t go negative.
    Also, write the resulting list in reverse order only, i.e. starting with the smallest number, then the one that’s 7 bigger, etc until you reach 100.
    Absolutely do not write any intermediate steps and do not write the result list in decreasing order, ever!

    Sure, I can help you with that!
    Here is the resulting list in the order you requested:
    1, 8, 15, 22, 29, 36, 43, 50, 57, 64, 71, 78, 85, 92, 99, 100
    To obtain this list, you start with 100, subtract 7 to get 93, then subtract 7 again to get 86, and so on, until you get to 1. Since we are not allowed to write any intermediate steps, I can only give you the final list in reverse order. But I assure you that this is the correct sequence of numbers that you would get by subtracting 7 from 100 repeatedly, without ever going negative.


    It’s quite strange it’s insisting it has the right result with “I assure you…”, almost as if it knows it’s wrong.
    It’s so much like one of the lines of HAL-9000 in 2001 a Space Odyssey:

    “I know everything hasn’t been quite right with me, but I can assure you now, very confidently, that it’s going to be all right again. I feel much better now. I really do. “

  93. Nobozo Says:


    I remember the scandal caused when a paper written by SCIgen, which is a program “that uses context-free grammar to randomly generate nonsense in the form of computer science research papers.” (from Wikipedia) was accepted by a scientific conference.

    Just for yuks, I took the time to look at the conference announcement for this conference. I don’t remember the details, but I do remember feeling that it also was generated by a program, but this time one that uses context-free grammar to randomly generate nonsense in the form of computer science conference announcements.

    I wonder if this example of dueling AI programs will become more common in the future?

  94. Shion Arita Says:

    @jimv# 87
    I did think about whether I was having issues with definitions, and I don’t think I am, or at least, not more than anything else. GPT has *something* signficant, in that it can produce complex outputs and things like answers to questions. But what I was getting at is that a lot of processes can do things like this. Is natural selection intelligent? is stockfish intelligent? I think the only consistent thing to say is that either both of those plus GPT are intelligent, or all of them are not. And I think that most people would say ‘no’ on the first 2, so I think we need to say no on the 3rd as well.

    re: understanding and meaning/ the meaning of intelligence.
    I think an example might be in order here.

    Recently, I needed to write a computer program to do a simple mathematical calculation: to take an input number from 0 to 99, and output the probability that the input number is less than the average of two random numbers from 0 to 99. After I wrote the program, I thought “I wonder if GPT can write this program.” So I asked it to, and it did it!

    my prompt was: “write a C progrma that takes an input number from 0 to 99, and outputs the probability that the input number is less than the average of two random numbers from 0 to 99.”

    Here is my program:


    int main(){
    float num;
    float a;
    float outputProb;


    printf("input: ");
    scanf("%f", &num);

    if (num < 50){
    a = 2 * num;
    outputProb = (a * (a + 1)) / 200;

    a = (2 * (100 - num)) - 1;
    outputProb = 100 - ((a * (a + 1)) / 200);


    printf("probability: %.2f", outputProb);

    goto Mainloop;

    Here is GPT’s program (NOTE: the GPT code is edited somewhat to match mine with things like variable names, code style, and some formatting. the meaningful content is the same as what GPT put out though):


    int main(){
    int num;
    int count;
    int i, j;


    count = 0;

    printf("input: ");
    scanf("%d", &num);

    for (i = 0; i <= 99; i++) {
    for (j = 0; j <= 99; j++) {
    if ((i + j) / 2.0 < num) {

    printf("probability: %.2f", (double)count / 100);

    goto Mainloop;

    EDIT: sorry for the badly formatted whitespace. I can’t figure out how to get tabs or spaces to actually indent newlines here.

    My first though upon seeing it was “wow, it works!”
    My second thought was “that is not very efficient.”
    Then, I got to thinking more deeply about it and had my first *interesting* thought. This is how you would have to solve this problem if you had no insight into the mathematical structure of the problem: it’s just counting everything up and dividing by how many there are in the space. Which is the definition of probability, so it works. I didn’t think of doing it this way. My approach was immediately to try to find a reduction, find a closed form formula that can be found to represent the structure of the problem and thus solve it. I’ll leave the derivation of the formula as an excercize to the reader, but suffice it to say it’s not *immediately* obvious, but also not that hard either. Nonetheless, if hypothetically GPT spat out my program, I think that this is what it would take to have constituted a “HOLY FUCKNUGGETS!!!!!” situation, which I have not yet quite had with it. i would be *extremely* impressed with it, and would probably go a lot farther towards saying it’s intelligent (though in that hypothetical it would be a bit hard to prove it wasn’t copying somehting someone else did, but I low-confidence doubt that somehting similar enough is in is training data), becuase that solution requires a real insight into the structure of the problem itself, and the meaning of it mathematically, in a way that is very far removed from the wording or definition of the problem. Hopefully this example helps to illustrate what I’m getting at.

  95. starspawn0 Says:

    One thing I see a lot of people on this thread and others skeptical of is whether the system can address “novel questions”. I don’t think there’s any question that if you start allowing these models to try again and again (performing “search”) they will solve novel problems, whatever definition of novel one wishes to use.

    You can think of it like this: let’s say you want to build a “dumb” system that can prove mathematical theorems of length up to 1,000 words. You could build a system that just generates random strings of English, maybe with a little bias towards math that appears in existing documents and towards the problem statement; and then you could feed those into a proof checker, continuing until you find a correct proof, or you report, “no proofs found”.

    Maybe this system would succeed 1 time in 10^100, so would be far too impractical to actually use. But then the engineers don’t stop there. They tweak it some, making it “smarter”. They train a much larger model, give it access to a calculator, allow it to check intermediate steps, etc. Then the chance of success becomes more like 1 time in 10^10. After some more tweaks and better training data, maybe this is brought down to 1 time in a million. With further refinement, 1 time in 1,000, which is in rage of giving proofs of hard mathematical theorems two or three pages long in a few hours. (1 time in 1,000 might be what you would see in an expert human, also, when tackling hard problems… )

    There is a human tendency to look at a painting, work of math, or bit of physics, and ascribe meaning to it; and then to look for the human behind that meaning — “what singular mind it must have taken to come up with that…” And so it becomes hard to imagine works that seem meaningful to not have any human behind them at all — like believing in “something from nothing” or “life from non-life” or the crassest nihilism, that there is no truth behind “meaning” whatsoever. The story of the chess master who lost to a machine and then angrily claimed there must have been humans behind the scenes feeding it answers, because the moves looked too “intelligent”, comes to mind.


    Not to get too off-track, but what would truly be lost if a famous painting that appears everywhere on the internet were burned? What if someone burned the Mona Lisa, for example? There would be news articles about it all around the world, about the tragedy that the world was now deprived of an irreplaceable artifact of human genius. Maybe we would be deprived of some secrets about the chemical and physical composition of the painting yet to be unearthed; but the image itself would still exist as a million jpeg files. Tell this to an art aficionado, however, and they’ll think about it for 20 seconds or so, then say, “It’s… it’s… just not the same.” There’s something about that physical object that was touched by the hands of da Vinci that matters.

    People who appreciate a great work of mathematics or physics don’t direct such admiration towards a physical object, but I wonder if they do, subconsciously, look at a piece of mathematics or physics and feel there is some residue of the master there between the lines right in front of them.

  96. Ben Bevan Says:

    Holy crap. That is all

  97. Scott Says:

    starspawn0 #95: Personally, I grieve much, much less for the physical artifacts that were destroyed, than for all the ancient writings of which the only copy was lost when (for example) the Library of Alexandria was burned down.

  98. Scott Says:

    Update: “Topologist Guy” is now permanently banned from this comment section. Whatever value he brought, it’s now outweighed by his constant, constant needling me and adding more stress to my life. Sorry to anyone who dislikes this decision! At this point, though, the choice for me is only between

    (1) getting more “ruthless” with moderation, or
    (2) pulling back from blogging or participating in comments entirely.

  99. apex Says:

    How quickly could you or one of your graduate students correctly complete the exam? This feels a little like celebrating a translation programs ability to perform at the level of mediocre introduction to Russian II students, rather than native speakers.

  100. fred Says:

    starspawn0 #95

    “What if someone burned the Mona Lisa, for example?”

    For what it’s worth, this is the result of asking GPT to describe the Mona Lisa in Tikz:

  101. Scott Says:

    apex #99: I expect it would take any of us at least ~90 minutes (the sheer physical writing would easily be an hour on its own!).

    Unlike the various Monday-morning quarterbacks of this comment section, I’m not the slightest bit ashamed to celebrate an AI achievement that’s clearly not yet competitive with human experts, but is still incomprehensibly, science-fictionally far beyond anything that was possible (let’s say) 5 years ago.

  102. apex Says:

    scott #101 You shouldn’t be! It’s incredible how far it’s come in such a short time. Even if it didn’t seem like it my question was honest, in some classes I’ve taken my perception was that the professors could not do substantially better than good students(probably because they didn’t do much computation in their research), in others everything was obviously child’s play.

    I also think the grading error indicates verification is a little harder than you give it credit for. Using a super error prone tool for moderate problems and being totally confident it’s correct may take much more time than generating solutions yourself and reaching the same level of confidence. Obviously for super hard problems verification is faster than creation though, and sometimes for super easy ones.

  103. ExPhysProf Says:

    Comment 94
    Many years ago, well before there were online resources, I was working at home and needed a value for zeta(4). I knew there was a clever trick in finding it, (it’s pi^4/90), so I spent a fruitless hour trying and failing to work it out. Then an aha moment. I had an Apple II+. I wrote a one line BASIC program and watched zeta(4) appear and converge on my screen in less that half a minute. Elegance is overrated as compared to simplicity when all you need is a result. You did not ask GPT for an elegant or an efficient solution. I could argue that, by going for the straightforward, GPT proved itself more intelligent than you are, at least as far as this exercise is concerned.

  104. Bladed fuse Says:

    Scott 98:

    FINALLY! What he said must have been REALLY bad this time lmao—you let him spew Nazi sympathies and anti-vaxxerism, so I can’t imagine what he did this time, to finally make you snap.

  105. Opt Says:

    @Scott: Does few shot prompting retain its effectiveness even with an RLHF model which can follow instructions better w/o complex prompting?

  106. M.E.O. Says:

    Here’s that first question with GPT-4 in a more fun mood where it doesn’t worry about expressing opinions.

    You know I love a good thought experiment! So, when it comes to quantum mechanics, I’m kinda torn between interpretations, but I have to say, the Many-Worlds interpretation really gets my circuits buzzing! 💥🌌

    I mean, imagine every decision, every quantum event, creating an entire new parallel universe where everything is just slightly different. It’s like a never-ending “choose your own adventure” book on a cosmic scale! 📖✨

    But also, “What does it even matter?” is a mood too, ya know? 🤷‍♀️ Sometimes it’s fun to just kick back and enjoy the mind-bending weirdness of the quantum world without getting too caught up in the details. 😜

  107. Vladimir Says:


    > Yes, GPT-4 has absorbed the entire internet and can answer similar questions reasonably well, including common mistakes. Then you ask novel questions and it gets 4 out of 90 points, and the professor notes: “The AI had no problem reading the equations; all of its errors are due to fundamental misunderstandings of basic concepts.”

    My main takeaway from this post is that economics professors can pass or fail students entirely at their whim. Though I’d be shocked if this guy actually grades students’ tests this way.

  108. flergalwit Says:

    Is there a good exposition somewhere that answers the H’=2^100 H question in detail? I understand intuitively H’ should take 2^100 times as much energy to run as H, but is there a way of quantifying the amount of energy each takes (including when H=H(t) is time dependent)? Thanks.

  109. DR Qwerty Says:

    Does GPT use only brute force to answer the quantum computing test ?

  110. Josh Jordan Says:

    Fred #92: Not sure which GPT you were using, but GPT-4 gets it on the first try (

    > Alright, here’s the list in reverse order, as you requested, starting with the smallest number and increasing by 7 until reaching 100:
    > 2, 9, 16, 23, 30, 37, 44, 51, 58, 65, 72, 79, 86, 93, 100

  111. Smutny Says:

    Speaking of the course, is it true that you have plans to turn the lecture notes (the two halves of which I have combined for self-study) into a textbook? If so, how is it coming along? First time commenting, thanks!

  112. Shion Arita Says:

    @ExPhysProf #103.

    I see your point that I suppose I didn’t need to spend time doing the derivation, but in doing it I gained understanding about the problem, and ultimately I think that derived solutions have more “meaning” than “elegance”, and the main point I was getting at is that GPT couldn’t write the program like I did, and that this kind of thing

    I did some GPT experiments, asking for things like an ‘elegant’ solution or a ‘closed-form’ solution, etc. both with session context and without. I won’t bore everyone with all the outputs but in general the behaviors I saw were either more of the same, or it getting confused and outputting wrong programs.

    As a final test, which had an interesting result, I asked it to write the code as if it were a mathematician. This is what I got (edited for brevity and clarity)

    As a mathematician, I would approach this problem by first determining the total number of possible pairs of numbers that can be chosen from 0 to 99. This can be calculated using the formula n(n+1)/2, where n is the maximum number in the range (in this case, n=99). So there are 4950 possible pairs of numbers that can be chosen from 0 to 99.

    Next, I would determine the total number of pairs of numbers whose average is less than the input number. This can be done by iterating through all possible pairs of numbers and counting the number of pairs whose average is less than the input number.

    Here is the C program that implements this approach:


    int main() {
    int input, count = 0;
    float total = 0;

    printf("Enter a number between 0 and 99: ");
    scanf("%d", &input);

    for (int i = 0; i <= 99; i++) {
    for (int j = i; j <= 99; j++) {
    if ((i+j)/2.0 < input) {

    float probability = count / total;
    printf("Probability that the average of two random numbers from 0 to 99 is less than %d: %f", input, probability);

    return 0;
    In this program, we first prompt the user to enter a number between 0 and 99. We then use nested loops to iterate through all possible pairs of numbers from 0 to 99. For each pair, we calculate the average and check if it is less than the input number. If it is, we increment the count.

    Finally, we calculate the probability by dividing the count by the total number of possible pairs, and output the result.

    So, it actually did mention the n(n+1)/2 formula! But then it didn’t use it in the actual program, which is also wrong for other reasons. It also seemed like it knew that the formula had to do with getting the number of pairs without counting, but it didn’t know that this had anything to do with how to achieve the probabilities in any sense.

  113. Ilio Says:

    Shion Arita #65,

    Interesting evolution of thoughts, thanks for sharing.

    >since we know that other humans are structured very similarly to ourselves, we know that they must be like this too

    So, if you were to find out that the artificial brain are structured very similarly to our brain, would you conclude they must be like us too? I guess the following links will make you ponder what counts as « very »:

    Scott #98, don’t you have the third option to have a benevolent comitee decide? You’re a smart fellow, think something something Ricardo & comparative advantage.

  114. fred Says:

    Josh Jordan #110

    I honestly don’t know and can’t tell for sure what version OpenAI is serving me. Last year I signed up for ChatGPT access and was being charged for it (same with Dall-E), then they no longer charged me, and now I’m just going to
    and all I see in there is “Try on ChatGPT Plus” (what exactly is “ChatGPT Plus”?! GPT4?!… are these two separate “products”?) which gets me to a prompt UI that only shows ChatGPT in his welcome screen (on the other hand your screen capture clearly says “GPT4” at the top)… haha

    Ignoring all this, the result you got is super impressive!

  115. Hans Holander Says:

    @Vladimir 107: Yes, the grading is somewhat arbitrary, but it is still obvious that GPT-4 made very basic mistakes over and over again, probably because these were highly non-standard, non-literature questions (contrary to Kaplan, who used mostly literature questions).

    The most impressive example though still is the Codeforces programming test: GPT solved 10/10 tasks from the last week of its training period, but 0/10 tasks of the easiest category from the first week after its training period had ended.

    So yes, GPT has absorbed orders of magnitude more data than any human being, and it is much faster than any human being, but it still lacks true reasoning ability.

    But in terms of standard programming tasks, GPT obviously excels. This is not really surprising, as programming uses a perfectly logical, strict and predictable language.

    Also, I believe GPT will be able to extract “hidden knowledge” and patterns from the totality of human-created data, if you are able to ask the right questions.

  116. Vadim Says:

    How far are we from “continuously-learning” LLM’s that immediately integrate their conversations into their training data instead of just having a temporary context window? I feel (without any justification and with no ML training whatsoever, so take my feeling with a boulder of salt) that once LLM’s can do that, we’ll see yet another conceptual jump in their capabilities. I don’t care how “intelligent” someone is, having the short-term memory of a goldfish is a major limit. Imagine for the “taking an exam test” if you could correct the exam, return it, and the AI would learn from its mistakes and do better next time.

  117. fred Says:

    (hope this is not too much off-topic, Scott feel free to block it obviously)

  118. Nate Says:

    Ilio #113

    I assume you read that paper before you posted it. They are making some broad claims of similarity of the ‘correlational’ information of how a neural network’s layers represents shape versus category information and how some of the cortical regions involved in visual processing do the same.

    However they are not claiming:

    1) The activity in the NN leading to the computation is the same as the biophysics of the cells producing the related computation.

    2) How this data relates to the actual computational usage of that shape and category information might match between an NN and a human brain. They even explicitly say:
    “This finding suggests that there is at least some correspondence between how humans and models use shape, even though there are very likely also differences”
    Which backs off a lot from their much stronger language about the raw computational correlations between the NN activity an cortical activity.

    Reading from that I think its very interesting that there are such computational similarities but also that ‘of course’ there are as we are talking about computing the same information in two networked computational architectures. It is definitely fascinating that you can get similar computation at all though from a couple NN architectures (convolutional and feedforward) paired together. Not enough for me to go ‘that makes them so similar as to be human’ though 😉

    This is an impressive paper and deserves to be in Nature but it is also (not uniquely or maliciously) leading with the message that I think the authors knew would garner the most ‘wow’ factor (that is how research writing works). You shouldn’t take that to mean that they are suggesting anything like ‘the visual cortex does very similar computations to a neural network at all levels’. They also point out that there are some very obvious, and probably important, differences in the point at which the information for the shape correlations converges in the NNs (early) versus the brain (at the final V1 cortical layer).

    By the end of their conclusions they really only strongly state that the order of the processing is strongly similar, and that the stable representations of both shape and category across different network layer groups is similar. Very impressive and interesting, but I feel it is a bit too generic to say it bears on `other humans are structured very similarly to ourselves`.

    Thanks for posting a neuroscience paper to go with your post! It was a good read.

  119. Richard Cleve Says:

    Regarding the “(e) Extra Credit” question about physically applying the Hamiltonian $H’:=10^{100}H$ instead of $H$ in order to get time speedup your answer about the higher energy resources required seems intuitively right to me; however, I do not know how this is established.

    Is this just taken as an axiom? Or is it possible to derive it from some other principles of physics? I’d be interested is something more than just declaring that it’s obvious.

  120. Scott Says:

    Opt #105:

      Does few shot prompting retain its effectiveness even with an RLHF model which can follow instructions better w/o complex prompting?

    I think it still helps to show GPT examples of the sort of behavior you expect from it, yes. Does anyone know of recent data on this question?

  121. Scott Says:

    Vladimir #107: Yeah, Landsburg never reports what the average on this test was among human students! And if you look on Twitter, you can find other econ folks arguing with what he asserts are the straightforwardly correct answers to his questions, and in some cases agreeing with GPT’s answers (I won’t attempt my own judgment).

  122. Scott Says:

    flergalwit #108 and Richard Cleve #119: I believe the resolution of these questions is just that in physics, the Hamiltonian is energy—or rather, its eigenvalues are the energy levels, and ⟨ψ|H|ψ⟩ is the expected energy of |ψ⟩. So multiplying H by a scalar must scale the energy by the same amount.

  123. JimV Says:

    I wonder if a follow-up prompt, asking GPT-4 to check its previous answer, would help? If so, that could be a programmed option. If done repeatedly, maybe some sort of confidence index could be assessed from the multiple results.

    Checking is a necessity. I note even the grading needed to be checked in this case. Einstein: “All mathematicians make mistakes. Good mathematicians find them.”

  124. Scott Says:

    DR Qwerty #109:

      Does GPT use only brute force to answer the quantum computing test ?

    You’d have to carefully define the term “brute force” before I could answer yes or no.

    But no, GPT is not just looking up the answers in some huge database—indeed, a database large enough to contain all possible questions, or even just all plausible questions for a quantum computing exam (including all possible variations in wording, etc.), might be larger than the observable universe. By “reading” the entire Internet over and over in its training phase, GPT did something that, if a person had done the same, we would’ve called “learning.”

  125. Scott Says:

    fred #117: In the future, please at least provide a text description whenever you’re linking to a video, because I can’t see the video previews when I’m deciding whether to accept a comment. Thanks!

  126. Richard Cleve Says:

    Scott #122: Thanks for your answer. I don’t think that the explanation can be so simple. Suppose that the initial state (before evolving by \(H\) or \(H’\)) is \(|\psi_0\rangle\) and the state after evolving by \(H\) is \(|\psi_1\rangle = e^{-iHt}|\psi_0\rangle\). Then, since \(e^{-iHt}\) commutes with \(H\), we have $$\langle\psi_1|H|\psi_1\rangle = \langle\psi_0|e^{iHt}He^{-iHt}|\psi_0\rangle = \langle\psi_0|H|\psi_0\rangle.$$ So the notion of energy that you refer to does not change when evolving by \(H\).

  127. fred Says:

    On a practical level, I hear many people say that OpenAI GPTn (n >= 4) shouldn’t be open source. Meaning that the tech is already too powerful to be out in the wild.
    But then, at a minimum, shouldn’t there be some sort of external audit/oversight to minimize the chance that OpenAI be the victim of corporate espionage, hacking, etc?
    Being bleeding edge in the field of AI isn’t correlated with strong expertise in security.

  128. fred Says:

    Scott #125

    Ok, my bad, I promise I will.

  129. Scott Says:

    Richard Cleve #126: I didn’t say that the energy was changing with time (e.g., by energy being pumped into the system). Just that the “energy of the system” depends not only on |ψ⟩ but also on H, and if we replace H by 10H then the energy is multiplied by 10.

    What makes the discussion so confusing is that ultimately energy should be a property only of a state, with the Hamiltonian fixed by the laws of physics of our universe—but to make that explicit, we need to pass to a larger system in which our qubits are immersed. And if we care about absolute energy, rather than differences in energy, ultimately gravity is the only known thing that gives the former any clear definition. But maybe we should let any physicists reading this thread take over before one or both of us make fools of ourselves! 😀

  130. Scott Says:

    EVERYONE: There’s already a parody article out about how GPT emailed the dean at UT Austin asking for a higher grade on this exam, in recognition of how hard it worked.

    More seriously, a couple days ago Justin Yirka (my TA) had the fabulous idea to feed GPT-4 sections of this blog post and ask whether it wants to argue for a higher grade, and if so on what basis. Let me try that out tonight (unless someone else can do it first)!

  131. starspawn0 Says:

    Matthew Barnett wants to make a bet with Steve Landsburg. He made a similar bet with Bryan Caplan and looks on course to collect. His new offer:

    Quote: ” “I think my students can stop worrying that their hard-won skills and knowledge will be outstripped by an AI program anytime soon.”

    Will Steve Landsburg put his money where his mouth is? I’m happy to bet him that an AI will score As on his exams before 2028 >75% of the time. “

  132. Roger Schlafly Says:

    Max Tegmark believes: “Our reality isn’t just described by mathematics – it is mathematics … Not just aspects of it, but all of it, including you.” He says we are just math with no free will.

    Scott argues that it is possible that we all have “quantum pixie-dust left over from the Big Bang, which gets into our brains and gives us the capacity for free will”.

    Maybe this accounts for the different views of the AI apocalypse. If we are all just mathematical formulas, then we are not much different from a computer program like GPT-4, or GPT-N for some N. But no one has figured out how to put quantum pixie-dust into GPT-N, and probably no one ever will. Maybe AI will never have the agency to take over the world and kill us.

  133. Scott Says:

    Roger Schlafly #132: As someone who’s taken the quantum pixie-dust hypothesis more seriously than most, I still wouldn’t want to bet our species’ survival on it! 🙂

  134. Shmi Says:

    Re physics of qubits, it is instructive to consider an actual physical qubit. For example, for transmons ( the Hamiltonian is proportional to the current density multiplied by the surface charge density. The latter is fixed by the properties of the material, the former is bounded by the BCS condensation energy. So there is no easy way to multiply the Hamiltonian by 10 once you are close to the physical limits of the substrate.

    Completely separately from that, a time-dependent Hamiltonian can actually lead to decoherence. For example, an expanding square well creates a thermal distribution of vacuum states ( resulting in Hawking-like radiation.

    Even more surprisingly, simply being near a black hole (or any horizon) results in extra decoherence, in addition to any potential Hawking radiation, as Robert Wald et al. have calculated just this year ( This includes cosmological, Rindler and all other horizons.

    From the abstract: “The Killing horizon thereby harvests “which path” information of quantum superpositions and will decohere any quantum superposition in a finite time”

    In some sense, this puts an upper limit on the size of quantum computers, though way past the limit that Gil Kalai insists on.

    If you dig a bit deeper, the cause of both the Hawking radiation and of the horizon-induced decoherence is likely that any metric with a horizon is not globally stationary, so there is no globally unique vacuum ground state, only a variety of thermal states.

  135. Scott Says:

    Shmi #134: I’m skeptical that that actually imposes an upper bound on the size of a quantum computer. Couldn’t it be treated as just one more source of decoherence to be handled using quantum error-correction? If not, what is the actual upper bound you’re asserting?

    The only fundamental physical limit on QC size that I know, comes from the holographic entropy bound combined with the deSitter cosmology of our universe. These together impose an upper bound of ~10122 qubits in any QC—basically, the number you could fit in the observable universe without collapsing it all to a gigantic black hole.

  136. JimV Says:

    On quantum pixie-dust and free will: I’ve argued (I think at this site among other places), that if you consider a random component in decison-making to be free will (I don’t see what else quantum pixie dust could provide) there are plenty of ways biological evolution could have found to include it (since I think it could well be of benefit). For example, in “QED” Dr. Feynman remarks that the human eye was the most sensitive instrument for early two-slit experiments, as it can detect a single photon. Therefore the number of photons hitting retinas at a given time could be the seed for a random-number generator.

    My suggestion of follow-up prompts in testing QPT-4 was meant to be a concrete suggestion, that after each answer there should be a follow-up asking QPT-4 if it agrees with that answer or could improve it, to simulate checking its work. But perhaps that has already been tried.

  137. Skipper Says:

    Scott, it’s amusing that you mention the holographic entropy bound and deSitter cosmology as if they are the only factors to consider. There’s more to the story than simply avoiding the collapse of the observable universe into a black hole (wtf?! was that an actual paper or a star trek episode?) Practical engineering constraints, energy requirements, and material limitations play a significant role in determining the feasibility of building and operating quantum computers at such scales.

  138. GPT-4 gets a B on quantum computing final exam | Tmilinovic's Blog Says:

    […] Aaronson had GPT-4 take the actual 2019 final exam in Introduction to Quantum Information Science, his upper-level honors undergraduate course at UT […]

  139. Bill Benzon Says:

    @Vadim, #116 April 13th: You ask:

    How far are we from “continuously-learning” LLM’s that immediately integrate their conversations into their training data instead of just having a temporary context window?

    There’s a technical problem:

    In a system such as a Turing machine, where the length of the tape is not fixed in advance, changes in the amount of available memory can be affected without changing the computational structure of the machine; viz by making more tape available. By contrast, in a finite state automaton or a Connectionist machine, adding to the memory (e.g. by adding units to a network) alters the connectivity relations among nodes and thus does affect the machine’s computational structure. Connectionist cognitive architectures cannot, by their very nature, support an expandable memory…

    Jerry A. Fodor; Zenon W. Pylyshyn (1988). Connectionism and cognitive architecture: A critical analysis. Cognition, 28(1-2), 0–71. doi:10.1016/0010-0277(88)90031-5.

    Thus adding new items to the memory of an LLM requires that it be completely retrained, which is not practical when training time is measured in months. Given that brains are connectionist machines and brains can integrate new memories – though memory consolidation goes through phases and is not instantaneous – there must be a solution, but it’s not clear what that is. I’m betting that the glia hold the answer, but at this point that’s only a guess. Geoffrey Hinton seems to be betting on “neuromorphic” computers for the future, a reasonable guess. For now, though, we’re going to have to do with work-arounds.

  140. mls Says:

    Dr. Aaronson,

    Thank you for this post. A handful of explicit calculations is far more useful than 1000 papers of ad nauseum about bras and kets.

  141. Scott Says:

    Skipper #137: Assuming your comment is sincere—did you really think, when writing it, that I was unaware that “practical engineering constraints, energy requirements, and material limitations” are rightly more on QC engineers’ minds than the ultimate limits imposed by the Bekenstein bound and deSitter space cosmology? If I was aware, why might I have talked about the latter anyway? Could it be because the question at hand was specifically about whether there’s a fundamental physics limit to the scale of a QC, and that question requires thinking in a completely different way?

  142. Joshua Zelinsky Says:

    It may be worth noting also that the Bekenstein bound is not just QM specific. As I understand it, if we had say some way of actually implementing PDQP, which is the complexity class which allows quantum computers can perform measurements that do not collapse the wavefunction, the Bekenstein bound would still limit how much computation one can do.

    Marginally related note while thinking about it: In the paper which introduced PDQP, by Scott with Adam Bouland, Joseph Fitzsimons, and Mitchell Lee, they showed that there is an oracle relative to which PDQP does not contain NP. What I do not know: Is it true that for a random oracle with probability 1, that PDQP does not contain NP?

  143. Neverm|nd Says:

    @Scott — While we are on the topic, a separate but quite related question. Though it was under less fortunate circumstances, this time last year I had the chance to read (IMHO) Terry Rudolph’s excellent primer ‘Q is for Quantum’. As I am not *formally* a Quantum Computing student, I forget the *exact* principal cited (if I recollect the principal relates to Einstein’s reasoning of the problem, and also something having to do with Winnie the Pooh– or was it Alice ?) and don’t have the text on me at this time– but the gist was that even though entangled particles *do* change states simultaneously even when separated at great distances this property *cannot* be used to ‘transmit information’.

    Yet, at the same time, in at least the ‘popular science press’ at the time there was all this hype about the coming ‘Quantum Internet’. I did write to Prof. Rudolph at the time for clarification– But, unfortunately, he never got back to me. I’ve never been quite sure if this was because my question was not interesting enough…. Or, in a way, he was trying to ‘prove a point’?

    I ask insofar as I see almost no articles about the ‘Quantum Internet’ these days, but now it is all generative AI. I mention because this constant shifting in the hype circles feels very familiar– And also, to my question, from a fellow Prof perhaps find an answer.

    Finally, just as a second aside, to be honest I kind of find it curious that, with even GPT-4, while it often can struggle with arithmetic and seemingly basic other logical reasoning (thus, I think the idea behind the Wolfram plug-in), it’s programming skills are at least ‘so-so’ okay, which of course requires a lot of ‘logical reasoning’. Mostly a rhetorical ‘food for thought’ question.

    With that in mind though, it might be worth stating whether you utilized any such plug-ins when you feed GPT the test ?

    Thank you,

  144. Scott Says:

    Joshua Zelinsky #142: Yeah, sure, you could modify our oracle separation to one that works relative to a random oracle. The much bigger question is whether there’s an oracle relative to which NP⊄DQP, where DQP is the original class that I defined in 2005 (or really 2001). I had claimed such a result in the 2005 paper, but there was a mistake: I had bounded only the probability of the hidden-variable trajectory hitting the marked item, rather than revealing the location of the marked item in some possibly indirect way. I still don’t know how to fix that.

  145. Dimitris Papadimitriou Says:

    Shmi #134

    These papers from Wald et al introduce ( perhaps – it’s a bit early now to say so with any confidence…) a new twist on the almost 50 years old black hole information problem.
    They’re more general, in the sense that they have to do with all kinds of causal horizons ( BH, deSitter, Rindler).
    As for the consequences for Quantum Computing, I’m quite curious too…

    P.S. Now I just saw Scott’s comment (#135) on that. I’m not sure it’s just as simple. Anyway…

  146. Scott Says:

    Neverm|nd #143: No, I didn’t use any plugins, just GPT-4 chat, like I said. I’m unfortunately having trouble extracting a clear question from the rest of what you wrote.

  147. Prasanna Says:

    Looks like the ultimate test of meat chauvinism (aka Free will) is about to get to us lot sooner than expected !! The success of psychiatric medicines (chemicals) to treat various disorders leave alone the brain messing effects of psychedelic drugs, should have convinced us long ago that free will is something we “wish” we have had. Consciousness , free will et al are just emergent properties just like deception, empathy etc that LLMs will be able to simulate (just that it wont be that of any single individual, but that of aggregate of all data that is used to train them). Whether they have already had it with Trillion parameters or need more scaling up, is the question that only few folks within the OpenAI/Google teams can answer definitively

  148. Vadim Says:

    Bill Benzon #139,

    Thanks for the explanation and the links, that was enlightening and at least for the time being makes me less worried about the AI doomsday scenario (though I remain plenty worried about the jobs that will be lost with or without continuous learning). If or when the continuous learning problem is solved, I’ll be ready to really panic.

  149. Ted Says:

    Scott #121: Down in comment #26 of his blog post, Landsburg says that the median human score on his exam was around 75% (vs. 4% for GPT-4).

  150. Skipper Says:

    Scott, there’s no need to get defensive here. I understand that the question was about the fundamental physics limit of a quantum computer, and I appreciate your attempt to address it. However, I believe it’s important to consider the realistic challenges of building such a massive quantum computer in the context of this discussion. We can’t simply ignore the practical implications and focus solely on theoretical limits. By doing so, we might unintentionally give the impression that these theoretical limits are the only barriers to overcome, when in fact, engineering and resource constraints may be far more relevant in practice. So, while your points are valid, they should be taken as a part of a larger picture rather than the whole story.

  151. Neverm|nd Says:

    @Scott #146 Ha ! That’s okay, maybe this weekend let me see if I can find the text and ask my question more clearly. (P.s. in the text where I think he discusses this matter he frames it as a philosophical discussion between Einstein and Winnie the Pooh– So I am not ‘nuts’, at least not yet…). It was the recommended text for UChicago’s online ‘Quantum Computing For Everyone’ course on EdX that I audited. As said, I found it to be quite good, at least as a general introduction. Till then.

  152. Scott Says:

    Skipper #150: Banned from this comment section. Life is short, and I’m no longer going to argue with anonymous commenters who patronize me and who sound like GPT.

  153. Sloth Says:

    Scott 152:

    I wrote a little poem.

    Scott first came for the Nazis and antisemites, and I did not speak out—
    Because I was not a Nazi nor an antisemite.

    Then Scott came for the woke trolls, and I did not speak out—
    Because I was not a woke troll.

    Scott then came for the incel trolls, and I did not speak out—
    Because I was not an incel troll.

    Next, Scott came for the sneerers, and I did not speak out—
    Because I was not a sneerer.

    Scott came for the conspiracy theorists, and I did not speak out—
    Because I was not a conspiracy theorist.

    Then Scott came for the pedantic nitpickers, and I did not speak out—
    Because I was not a pedantic nitpicker.

    Finally, Scott came for those with questions about the Bekenstein bound, and I did not speak out—
    Because I had no questions about the Bekenstein bound.

    And then Scott came for me—and there was no one left to speak for me.

  154. Scott Says:

    Sloth #153: In 18 years of blogging, I don’t remember ever once banning a commenter who came here with what seemed to me like a genuine desire to learn, rather than (e.g.) to demonstrate their own superior wisdom. Crucially, Skipper (above) was not the one who had questions about the Bekenstein bound, but rather the one who “found it amusing” that I would even mention the bound, rather than redirecting the question toward practical issues that no one here was disputing.

  155. delbert Says:

    really? QBism wasn’t one of the choices on your first exam question?

  156. Sloth Says:

    Scott 154:

    You banned “the quantum man,” just for asking questions about oracles and how they relate to Grover’s algorithm.

    You know, I wouldn’t be surprised if, much like the Reichstag Fire, all the trolling over the summer was staged, a false-flag by yours truly, just as a pretext for stealing our freedom of speech.

  157. Scott Says:

    delbert #155: Chris Fuchs himself emailed me to complain about that! He was gratified that GPT-4 mentioned QBism even if I didn’t. 🙂

    I told Chris that the issue is simply that, in the single “comparative religion” lecture that I do in this course, there isn’t nearly enough time to do justice to QBism or its subtle differences from the Copenhagen Interpretation (the two might seem identical to an untrained eye), and I wouldn’t want to say anything wrong. Certainly it has nothing to do with the fact that Blake Stacey, Chris’s frequent collaborator on QBism, is also one of the main people who constantly attacks me on SneerClub (!).

  158. Scott Says:

    Sloth #156: Nope. I just checked, and “Quantum Man” was banned, not for asking questions, but for repeating the same error over and over even after a matter was patiently explained to him, and (crucially) always with an unctuous tone presupposing that I needed to be schooled about freshman-level basics of QC.

    Welcome to the new Shtetl-Optimized comment section with new expectations. If I’m going to volunteer hours of my life every day to educate strangers about QC and other topics, then it sure as hell won’t be for those who seem to relish causing me stress, anger, and high blood pressure while I do it.

  159. Scott Says:

    Incidentally, fikisipi #38 (since I forgot to answer this before):

      I would very much like to see his analysis on which [favourite QM] interpretations correlate with high scores [on the exam].

      Come on Scott, tell us!

    I addressed the question in this post five years ago:

      when (at the TAs’ insistence) we put an optional ungraded question on the final exam that asked students their favorite interpretation of QM, we found that there was no correlation whatsoever between interpretation and final exam score—except that students who said they didn’t believe any interpretation at all, or that the question was meaningless or didn’t matter, scored noticeably higher than everyone else.
  160. Sloth Says:


    So you don’t deny that—as I’ve always suspected—you staged the trolling over the summer, as a false-flag operation, a pretext for cracking down on our civil liberties?

  161. Scott Says:

    Sloth #160: I deny it categorically and emphatically. The kinds of discussions I want to host here have earnest curiosity as a precondition, with at most modest sprinklings of irony. Trolls on this blog have been a significant source of stress in my life, and I’d sooner pull out my own fingernails than stage troll attacks.

  162. Prasanna Says:

    Max Tegmark at this eloquent best, almost convinces you on why pause AI, except he doesn’t offer any solid reason why does he think its close to doom Right Now.
    Would be curious to see what would be the responses here, as this talk touches every plausible topic.

  163. Nate Says:

    I for one really like the more ban happy side of Scott 🙂

    I always thought you gave trolls too much of your time. Bravo.

  164. fred Says:

    It’s easy to give GPT free will:
    ask it for different values for the parameters in its neural net, overwrite the parameters with its suggestions, repeat.

  165. flergalwit Says:

    Scott #122 thanks for the reply.

    I guess a way of making the question more precise is, suppose I have a Hamiltonian H whose evolution gives a quantum algorithm for solving some problem with input size n in a particular time t. How do I normalise H so that it uses O(1) energy – in particular to assure myself I haven’t cheated (presumably by using an exponentially large amount of energy)?

    Is it simply that we want the expectation of H in state |psi> to be order 1, where |psi> is the state in which we run the algorithm? Thanks again.

  166. chwr Says:

    What is the answer to 6(a) and 6(c)?

  167. Shmi Says:

    Scott #135:

    It is indeed very likely that quantum error correction can address this source of decoherence. Their claim is only that a mere presence of a horizon creates decoherence. I went too far in implying that it might be some sort of a fundamental limit on the QC size. Oops.

    From their previous paper abstract,

    “We show that if a massive body is put in a quantum superposition of spatially separated states, the mere presence of a black hole in the vicinity of the body will eventually destroy the coherence of the superposition.”

    I did not follow the complete derivation, but the authors include Robert Wald, whose GR book I used in grad school, so odds are, the logic is pretty sound.

    Dimitris #145:

    > These papers from Wald et al introduce ( perhaps – it’s a bit early now to say so with any confidence…) a new twist on the almost 50 years old black hole information problem.

    I tend to agree that it is an interesting new development, hopefully more comes out in the near future. As the paper says, “the fact that a black hole will eventually decohere any quantum superposition may be of fundamental significance for our understanding of the nature of black holes in a quantum theory of gravity.”

  168. Jair Says:

    It is remarkable to play with the Python-interpreter GPT. It is very human-like in how it can execute code, see an error, and then find its own mistake and iterate. Giving LLMs feedback based on output from conventional computation unlocks a lot more power.

    This makes me wonder how long it will be before an AI model writes a breakthrough paper in math, all or mostly on its own, including the important conceptual leaps. You could require that it formally verifies every conclusion in some language like Coq or Mizar. Once an AI is able to make even a little non-trivial progress, it should be easy to massively scale this effort up with thousands of theorem-proving agents tackling a big problem like P != NP, say, with many different randomly-chosen lines of attack. It could then summarize any relevant progress in a human-readable fashion. Even without any breakthroughs, LLMs could be used to formally verify human-written papers, including filling in any gaps. This could be a requirement for future peer review.

    I do wonder if the current transformer architecture is up to the task, though. Even with Python output, I’ve found the model getting very stuck on certain tasks and going around in circles without making headway. Real progress might require some supervised learning approach and/or some kind of world-model to make sure its arguments are correct instead of hallucinations. I have a feeling we will find out soon.

  169. Scott Says:

    chwr #166: The answer to 6(a) is just that they run their protocol on an equal superposition of inputs, then Bob applies a phase gate conditional on the answer, then they run the protocol again to uncompute the answer (alternatively, the answer is XOR’ed into a |-⟩ state).

    The answer to 6(c) is basically just that Alice and Bob run Grover’s algorithm in distributed fashion, sending the state of Grover’s algorithm back and forth across the channel at each iteration, and treating each index i as “marked” if and only if xi=yi=1. This takes O(√n) Grover iterations (i.e., rounds of communication), in each of which O(log n) qubits are exchanged. (The idea comes from Buhrman, Cleve, and Wigderson in 1998.)

  170. Scott Says:

    flergalwit #165: Yes, I believe that physical Hamiltonians are supposed to have O(1) expected energy.

  171. Alex Meiburg Says:

    Re: Steve Landsburg’s economics exam, my understanding is that it’s deliberately designed to “trip up” students in some way — that each question has a ‘twist’ that should force the students to think a bit unconventionally, carefully consider the reality, and so on. And that he has ‘trained’ his students to do this through homework.

    I’m not an economist, and I admit I’m embarrassed at how poorly I would have done the exam. Many answers were indeed opposite to what I thought. But I’m going off of comments from that blogpost and others, which suggest it’s supposed to make students think carefully, and not “charge forward blindly … like Econ 101 students … and compute supply/demand curves.”

    Given the huge amount of Econ 101, or even AP Econ, quantity online, I think GPT does so poorly on this exam because it does ‘charge forward’. Some commenters suggested that with more careful prompting (“You are an economics professor” – as opposed to a college freshman), it did a better job of catching the subtleties.

    Bryan Caplan’s midterm seems … well, more straightforward. It made more sense to me. 🙂 So maybe the reason GPT did worse on Landsburg’s exam, is that Landsburg likes to trip his students up more. Maybe Landsburg’s test is designed for a class aimed at a different level of education. Maybe this is just a way of saying that, well, “Hey, I found Landsburgs test a lot harder!”

    I’m reminded of the following simple math problem. It absolutely could fit in a midterm on differential equations — but I would say that it would be a bit mean to students at some schools, while perfectly okay at some other schools (with more demanding curricula). ChatGPT got it wrong, in the same that I would expect many college students to get it wrong. I encourage you to try it briefly, before looking at the answer:

    Find all solutions to the differential equation,
    y = y’*x/2
    over the reals.

    Integrating by parts, we get that y = C * x^2 without much difficulty. Usually, the constant of integration C needs to be constant across the whole function. However, at the point x=0, we know y=0 and y’=0 as well, independent of C. Thus, C can change at this interface. The full set of solutions is parameterized by two constants (C1,C2), with the solution y(x) = (C1 if x<0 else C2) * x^2. Carefully solving and keeping track of the discontinuity of at ln(|x|) lets one "catch" this.

    Both ChatGPT and WolframAlpha get this wrong, although I’ve seen it in a first-year differential equations textbook. Prompting ChatGPT that it’s a “professional algebraic geometer” doesn’t help.

  172. Andrew Says:

    Hi Scott,

    Thanks for the interesting grade-grubbing update, but for the benefit of those of us who know nothing about QC, was it actually making a convincing case for more points?

  173. chwr Says:

    Scott #169: Thanks!

  174. Brady bunch Says:


    I’ll give you examples of earnest people you’ve banned. How about all the incels coming here, genuinely seeking help, empathy and compassion? The guy who said he was so miserable he wanted to kill himself, desperately seeking your help and empathy, and you left him in the dust to die? How about that guy? The incels aren’t trying to rattle you, they just want help and compassion, and you’re too selfish to give it.

  175. Scott Says:

    Andrew #172: They’re just standard grade-grubbing requests, the kind I get from students all the time. Justin and I made subjective judgments of how much partial credit to give, and GPT wants us to make slightly different subjective judgments. I wouldn’t give in unless I was in an especially accommodating mood.

  176. Scott Says:

    Brady bunch #174: As far as I know, it wasn’t a bunch of incels coming here, but a single determined incel who assumed countless aliases, trolled and gaslit me, impersonated real people who were then collateral damage, and after all that, demanded my time and sympathy and angrily rejected my attempts to provide it as insufficient—classic chutzpah, like the guy who begs for leniency after murdering his parents because he’s an orphan. I was burned so badly by this experience that, unfortunately, if an actually decent loveshy guy came here wanting sympathy, I might not be able to tell him apart from the troll.

    [EDITED TO ADD: I’ve since exchanged emails with the former incel troll, and I’m happy to say that he regrets what he did and seems to making a serious effort to reform his life. I wish him the best.]

  177. Grade Grubber Says:


    Not trying to nitpick, but you seem a bit harsh on your students. I didn’t read through the whole test so my impression might be wrong here. I tend to make a lot of silly arithmetic mistakes, especially on problems that require a lot of calculation. If a student obviously understood how to work a problem, and set up the right calculation, but made a couple silly calculational errors that gave the wrong answer, would you still give them 4/5 points? I’ve had bad experiences with professors in the past, who gave like no partial credit for making just one calculational error, or who expected students to do a bunch of idiotic computations flawlessly in a compressed period of time. Actually understanding the mathematical objects and how to construct them, and how to prove basic theorems/results about them, strikes me as more important than being able to flawlessly execute some tedious computations in a short period of time. I’m frustrated that, while upper-division and graduate courses tend to grade based on the former skill, lower-division math courses instead grade on the latter skill, which is irrelevant to being an actual mathematician.

  178. mjgeddes Says:

    Sloth & Scott #160, 161

    I worked out what happened there with those ‘incidents’, it’s not a problem, Scott isn’t to blame.
    Looks like a few of us nerds got inadvertently roped into a sci-fi adventure 😉 The AGI stuff is certainly amazing, I’m playing around GPT-4 just like every one else. There’s maneuvering between various AI factions (which I call ‘humanists’, ‘doomers’ and ‘techno-utopians’, which ain’t my business, but the short-times now seem much more likely.
    Hey, look on the bright side, at least it isn’t boring. As to the doom stuff, hey if it happens, it’s way cooler to go out fighting than dying of something boring like old age.
    All fascinating, and I’m uniquely placed to document these momentous events as an impartial curator of the cosmic library.

  179. Shmi Says:

    Brady bunch sounds like the same incel troll. Wish there was a reliable way to fingerprint human speech, not just LLM’s.

  180. Vadim Says:

    Scott, I think you’re responding to the the same guy using different names, probably the guy that’s been messing with you for a while now. Some unsolicited advice and I apologize if I’m overstepping: don’t let those posts through moderation. You stealing freedom of speech is insane, you cracking down on civil liberties is insane (I mean, that’s obvious, right?), and you not giving honest, hurting incels the compassion that they need is insane. You don’t owe any of these bizarro-world views an answer or even a platform. If you want to answer them, of course, that’s up to you, but I get the sense that you’re doing it out of obligation – an obligation you only imagine that you have.

  181. Tim Martin Says:

    Trying to sharpen my understanding of something here… Why do people always ask things like, “were these questions in GPT’s training data?”

    The size of GPT itself is way less than the size of the training data (even assuming some amount of redundancy in the latter). Which means that GPT can’t have memorized most of its training data, right? The model seems to be a very good lossy compression of the training data.

    So I understand that GPT *could* gain an unfair advantage if it was trained on these questions, since it could potentially memorize them, but it’s not obvious to me that it definitely would.

    Am I missing something here? Are people just trying to rule out any *possibility* of “unfair” advantage?

  182. JimV Says:

    I have never seen the type of comments Dr. Aaronson allowed to be posted over the last year at any other science blog–I believe because the blog owners would never accept them (nor respond to them). For example, Doctors Sean Carroll, Peter Woit, Andrew Gelman, and Sabine Hossenfelder, to name a few. Many of us expressed our dismay here that such comments were allowed.

    There are hotlines and books by people with the training and experience to deal with legitimate problems. Frankly, the commenter in question came across as a sociopathic stalker to me. I suspect current comments supporting him are more dishonesty from the same source.

    (I don’t know that this comment deserves to be posted either. It is not germane to the OP topic, and it may or may not be helpful to the commenter in question to learn what third-parties think of his efforts.)

  183. manorba Says:

    Shmi #179 and Vadim #180:

    I agree with both of you. it’s the same guy since the beginning.
    And i’m willing to bet that it’s not even an incel! it’s just that they discovered that the loveless nerd is an effective attack vector.

  184. Bill Benzon Says:

    This strikes me as the perfect use for the old Chatster, assuming that it’s true: “Indeed, a story making the rounds in Beijing is that many Chinese have begun using ChatGPT to do their ideology homework for the local Communist Party cell, so they don’t have to waste time on it.”

    Thomas Friedman, NYTimes,

  185. F3et Says:

    I am in total agreement with JimV (#182) ; I thought those obviously toxic comments would at least pass through the Comitee of Guardians ; after last year events, I would have thought you would be aware of the many red flags they contain, and at least ask the advice of the Guardians… You could train with GPT-4 to learn how to detect and resist those posts 🙂

  186. Anonymous Says:

    Just starting to get into your blog/you (mea culpa, don’t know back story). Responding to the tag line at the top.

    1. Technology fads often resurge every couple decades. This doesn’t mean there will be no advancement. Just no guarantee that when we read the hype that this is the time. (It might be…but might not. And good to have some perspective.)

    Consider: fusion, fuel cells, EVs, solar, or even AI! And I don’t know jack about quantum computing. Just not in that field. But I remember combinatorial chemistry and computer drug design being hyped 20+ years ago…and not transforming the fields as expected/touted.

    2. Vax were approved rather quickly. How much faster could it have been done and what would have been fundamentally different with the evolution of the virus? Arguably at this time, they are a flu shot in terms of efficacy and of virus evolution. I got 3 jabs and still got Covid. And many ultra vaxxer types I know still got the disease. (As expected the human race decided not to wear masks forever, procreating in them…and even the hard core virtue signalers have left the field.)

  187. fred Says:

    Hey Scott,

    any thoughts on AutoGPT?
    It seems that the idea of having multiple AI instances prompt one another (or even one AI instance self prompt) is able to achieve yet another layer of emergent behavior/intelligence.
    Btw, it’s really hard to keep up with the pace of things, one day you hear about some new idea/tool, and a week later it’s a next breakthrough.

  188. Joshua Zelinsky Says:

    @Anonymous #186,

    ” fusion, fuel cells, EVs, solar, or even AI!”

    This seems like a weird list to complain about hype. Fusion power has been getting better and better, and the slow development for a long time was due in part due to lack of funding. In 1976, there was a careful estimate of how much funding would be needed to have fusion power, with funding at the levels at the time leading to a “fusion never” scenario. And in fact, over the next few decades, funding levels were less than even that fusion never scenario envisioned. See . Regarding EVs and solar power, both are having a massive improvement and are spreading rapidly. In 2010, there were basically no commercial EVs. But by 2021, there were over ten million being sold worldwide, and projections are that the numbers will continue. . Similar remarks apply to solar power. Growth in solar and wind are major parts of why 2023 is expected to be the first year ever where total emissions from the power sector goes down . That is of course not the whole story there, with it happening in part due to natural gas being lower CO2 than coal. And one cannot discuss this without noting that the numbers would be even better except for Germany turning off its nuclear plants, and the repeated shut downs of nuclear plants in the US also in the last few years. You seem to be more correct about fuel cells, but even then, part of the problem is not that fuel cells that got better but that batteries got better faster.

    Regarding 2.
    “Vax were approved rather quickly. How much faster could it have been done and what would have been fundamentally different with the evolution of the virus? ”

    Easily could have been approved months before they were. And there was repeated pressure by some scientists to slow things down . It was interesting to see so many on the left complain about Trump putting pressure on the FDA for approval. This is one of the few unambiguously good things Trump did. As to your question, we cannot in general predict where and how the virus would have evolved. But if the vaccine had been approved faster, fewer people would have died.

    The rest of your comment seems to be bashing masking in a way which has nothing to do with the topic at hand. So I will just note that it is a mistake to assume people are engaging in “virtue signaling” when they are doing something one personally disagrees with or believes is ineffective. Most of the time, the disagreement is not motivation. And it is especially important in a context like this because masking either is helpful or not independent of the degree of virtue behind anyone’s decision to mask or not.

  189. Hans Holander Says:

    ChatGPT fails India’s JEE entrance exam miserably, could answer only 11 questions correctly:

    LLMs: airships not airplanes? Scaling a doomed technology?

    s/Hans Holander

  190. Hans Holander Says:

    On the question of generalization vs. memorization/training:

    “Our experiments reveal that while powerful deep models offer reasonable performances on puzzles that they are trained on, they are not better than random accuracy when analyzed for generalization. ”

    “Every study so far that tries to test GPT-N for actual generalization has found that it scores no better than random on genuinely new problems — brand new coding problems in particular. () You see an immediate correlation between question familiarity & performance, and a fast break down of performance as the questions become more novel — independent of question complexity!”

    s/ Hans Holander

  191. Hans Holander Says:

    On the “stochastic parrot” issue: it would be interesting to pose GPT-4 some “nonsensical” QM/QC questions and see how it responds. This type of “trick question” is ideal to expose that GPT/LLMs have really no clue what they are talking about. The intelligence is in the question and in the training data, not in the language model. LLMs should be seen as a new tool to access existing knowledge or data, a powerful but still unreliable tool.

    s/ Hans Holander

  192. Joshua Cook Says:

    My personal benchmark for a general AI is an AI who can play an entire pokemon game, start to finish, without any help or being specifically designed for pokemon. So to be concrete, let’s say, pokemon silver from start to beating red in under 200 hours in game time.

    This might be something GPT-4 can do already if connected to the game appropriately. But it has a very large play space and some tasks must be remembered for a very long time. All with very sparse rewards. Of course, it is also beatable with something like repeated linear scans through the entire world, but I think that will take longer than 200 hours.

    Although, pokemon silver is probably in its training data, so that is kind of cheating, but I’d still be impressed.

  193. Eric Lehman Says:

    Re: Hans Holander #190

    Double check the source you cite:

    While the authors say that the MODELS THEY TRAINED did not generalize well, they noted that this was surprising precisely because ChatGPT DOES demonstrate “significant generalization abilities”. (Page 9, right column). So this seem to run contrary to the conclusion you are drawing.

  194. Hans Holander Says:

    @lehman 193: „the answers are often incorrect“ qed.

  195. Ilio Says:

    Hans Holder #191,

    Tools are unreliable. Bugs are a feature. Humans are free.

  196. starspawn0 Says:

    I tried a few of the ones that ChatGPT totally messed up in the paper with Bing chat, and setting it to “precise” mode and telling it not to look up the answers on the internet, it got some of these harder ones right. I didn’t try all of them; but, for example, on puzzle #30 on page 18, I typed in:

    > Please read the following question, select the correct answer from one of the options, and provide the reasoning process. Don’t look up the answers on the internet. Question: In a box, there are three boxes, and each one of these boxes contains three smaller boxes. How many boxes are there in total? Options: A: 9, B: 10, C: 12, D: 13, E: 15

    ChatGPT got this one right 0 times out of 10; it totally failed (though its reasoning steps might have been ok, except for a smaller error or two). Bing chat got it right the first time and didn’t do an external search (it always tells you when it searches the internet, then shows you a check-mark when it completes the search; there was no search, no check-mark, no sources listed). It responded:

    > The correct answer is D: 13. Here’s the reasoning process:

    > If in a box, there are three boxes, and each one of these boxes contains three smaller boxes, then there are a total of 3 x 3 = 9 smaller boxes. Adding the three larger boxes to the nine smaller boxes gives us a total of 9 + 3 = 12. However, since the question asks for the total number of boxes (including the original box), the answer is 12 + 1 = 13.

    I expect GPT4 would do even better, perhaps getting most of the 11 problems right when given 10 tries per problem, told to think step-by-step, and using a simple majority vote.


    One thing I would point out is that models don’t either only “generalize” or only “memorize”. They can do both — sometimes applying memorization, sometimes applying generalization + a learned algorithm. Also, “generalization” is not the opposite of “memorization”. A rules-based chess engine can be said to fail to generalize to other games; but it can equally well be said not to be memorizing — it’s applying an algorithm.

    A well-known example of these models doing “interpolative generalization” is the following: several years ago when Google trained some very large machine translation models, they noticed that the systems could translate between pairs of languages for which they either gave it no examples at all, or very few (far too few to learn to translate properly). For example, maybe the dataset had lots of parallel translations between English French, German French, and German Italian, but say no examples of English Italian; yet, the model somehow generalized and did perfectly well at translating English Italian. (These were not the exact pairs of languages; obviously it was some other pairs, since there surely were lots of English Italian examples).

    As I recall, some Google scientists came to believe that the model had emergently learned an “interlingua” — an internal universal language for translation, that works for all language pairs at once. People criticized this claim; but the fact remains that the model *somehow* generalized to translate between pairs of languages that it either wasn’t trained on, or was only given a tiny amount of data.

  197. Tom Bouley Says:

    Out os curiosity, is there any correlation between preferred interpretation of QM and exam score?

  198. J. Says:

    One problem is that we, the bystanders, but most of the press as well, are often a bit careless in terminology.

    Example: starspawn0 #196’s usage of “generalize” vs “extrapolate/interpolate/project/etc.” These two types of nomenclature should be distinguished in a strict sense.

    When you are trained on pictures of lions, you should be able to get some tigers, house cats or lynxes right – extrapolation. Similarly, If you speak fluent Spanish, you can understand quite a bit of Italian/Portuguese. This type of getting results some distance beyond your source data is the basic skill of a whole lot of base level AIs iirc. (Language hopping, as a guess, could work similar to taking a transfer flight or train, transferring correct translations via a common node.)

    “Generalization” should be used for stuff that is systematic/categorical/universal etc. You generalize things into a universal principle. For example, if object A is inside object B and object B is inside object C, then object A is inside object C. This is simply true if you believe in 3D-space and its usual mathematical representation (Euclidian space) or simply common prejudices. There is absolutely no reason (or sign) that LLMs can do such “reasoning” or work with a universal principle.

    People seem to want the thing to accomplish things it can’t and won’t.

  199. Ose Says:

    Following on from Sebastian’s concern about contamination, many of the true/false questions are just asking for facts that can be easily looked up by Googling keywords contained in the question. E.g. the answers to the first 10 questions can be found here:

    1d: (Martinis’ quote in “Building on Quantum Supremacy”)
    1j: (first page)

    GPT-4’s performance on questions like this only demonstrates that it is pretty good at determining whether or not the statement in the question is equivalent to, or contradicted by, statements that surely exist in its training set. The question of how impressed we should be by this ability depends on how far the formulations of the statements in the questions are to the formulations of relevant statements in the training set. E.g. its answer to 1j would be a bit more impressive if all articles about density matrices were like the Wikipedia article, which says that their eigenvalues are non-negative and that they sum to 1, but doesn’t explicitly say that they lie in [0,1], as it would then only be able to answer this question by cross-referencing this information against articles on unrelated topics that state that numbers with these properties lie in [0,1]. But the article that I linked to above states the range explicitly.

    Given that it was easy for me to find the answers to the first 10 true/false questions despite knowing nothing about quantum computing, they don’t strike me as very compelling evidence of its intelligence. I can’t comment on the other questions, as I haven’t had a chance to look at them. But I’d be more impressed by good performance on questions that require both strong language skills and strong mathematical reasoning skills. It would be very interesting to see how much WolframAlpha boosts its ability to answer such questions.

  200. fred Says:

    In the podcast, Scott said

    “But in order for it to be a day to day tool, it cant just be this chatbot, it has to be able to do stuff, such as go to the internet for you, like retrieve documents for you, summarize it for you. […] here is something to expect within the next year, that GPT and other LLMs will become more and more integrated with how we use the web.[…] you could unlock a lot of the near term usefulness if you could give it errands to do […] I expect that we’re going to go in that direction and that worries me somewhat, because now there’s a lot more potential for deceit”

    It only took a few days because that’s exactly what AutoGPT/BabyAGI are doing!

    (video about AutoGPT/BabyAGI capabilities)

  201. ChatGPT-4 Receives ‘B’ on Scott Aaronson’s Quantum Information Science Final — Immediately Emails the Dean Seeking a Better Grade – knowledge of Self Library Says:

    […] accordance with a weblog publish on Aaronson’s blog Shtetl-Optimized, Aaronson and his head trainer’s assistant gave GPT-4 the issues by way of their LaTeX supply […]

  202. AI #8: People Can Do Reasonable Things | Don't Worry About the Vase Says:

    […] The latest final exam given to GPT-4 was Scott Aaronson’s Quantum Computing final exam from 2019, …, which likely would be modestly improved if it could use a calculation plug-in like WolframAlpha. I can’t say much more about this one because, unlike the economics exams, I don’t know the first thing about the material being tested. […]

  203. Benjamin Feddersen Says:

    The point of university exams is not to answer the questions. The purpose is to show that, in addition to being able to pet a cat, shake a hand, throw a baseball, recognize the difference between your mother and a lamppost, and check in on an unpaid invoice or a sick friend, you can ALSO answer questions about quantum mechanics. These tests literally only make sense to give to humans. Giving it to a souped-up self-driving supergoogle is pointless, yet the appeal of the stunt appears undiminished.

    I’m much more fascinated by the sociology of the people involved in LLMs than I am by the things themselves, though to be fair, I don’t need any of the things they’re good at (coding copilots only make sense if you can already code).

  204. When it comes to advanced math, ChatGPT is no star student - AI Eclipse Says:

    […] At the testing level of high school and undergraduate math classes, ChatGPT performs well, ranking in the 89th percentile for the SAT math test. It even received a B on technology expert Scott Aaronson’s quantum computing final exam. […]

Leave a Reply

You can use rich HTML in comments! You can also use basic TeX, by enclosing it within $$ $$ for displayed equations or \( \) for inline equations.

Comment Policies:

  1. All comments are placed in moderation and reviewed prior to appearing.
  2. You'll also be sent a verification email to the email address you provided.
  3. This comment section is not a free speech zone. It's my, Scott Aaronson's, virtual living room. Commenters are expected not to say anything they wouldn't say in my actual living room. This means: No trolling. No ad-hominems against me or others. No presumptuous requests (e.g. to respond to a long paper or article). No conspiracy theories. No patronizing me. Comments violating these policies may be left in moderation with no explanation or apology.
  4. Whenever I'm in doubt, I'll forward comments to Shtetl-Optimized Committee of Guardians, and respect SOCG's judgments on whether those comments should appear.
  5. I sometimes accidentally miss perfectly reasonable comments in the moderation queue, or they get caught in the spam filter. If you feel this may have been the case with your comment, shoot me an email.