There’s been a lot of work on reducing the classical cost, but mostly before April. Is advantage back?

the paper: https://arxiv.org/abs/2304.11119

]]>https://arxiv.org/abs/2309.06979

One of his theorems, in informal language (page 2):

“Theorem 1 (informal). For any function f that can be efficiently computed

using a Turing machine, there exists a dataset D such that training a (linear)

auto-regressive next-token predictor on D results in a predictor that approximates f.”

(One of several related posts of mine back in March: https://scottaaronson.blog/?p=7094#comment-1947372 )

I think some people wrote comments on some posts on this blog doubting something like this could be true, thinking that, yes, LLMs can compute very general things, if you get to precisely change the weights; but actually *training* them on data to do this is another matter — and in fact you won’t succeed in every case.

….

I think Malach’s examples with very simple linear neural nets using the TinyStories dataset (with a final non-linear output when a token is chosen) and MLPs using an arithmetic dataset with chain-of-thought, are also very interesting. It makes me wonder if it’s possible to create a synthetic dataset with liberal use of chain-of-thought and inner-monologues (written in invisible tokens) so that if you train one of these ridiculously simple models with it, you get a language model that is as good as Transformer-based ones, but using dramatically less computation and maybe even number of parameters. Maybe a pretty good one would even fit on a smartwatch.

And with the right data you might even make the model smarter than the “teacher” in many ways. e.g. using something like what I described here to @nole a few months back:

https://scottaaronson.blog/?p=7188#comment-1949227

I chose that one because I know it works! — turns out some researchers at Meta very recently did something similar (and added a filtering step, that also was automated):

]]>How’s life treating you in general, outside of computational complexity and deep learning that is 🙂 You mentioned on a recent post (can’t remember exactly which one) that the kids / family obligations and whatnot have been soaking up a lot of your time. I mean — how do you deal with raising two kids while at the same time being a professor, working at OpenAI, and running the top complexity blog on the internet—is there a secret 🙂

]]>In the future, all comments with needlessly hostile embedded premises will be deleted without explanation.

]]>**A New Counter Culture: From the Reification of IQ to the AI Apocalypse**

I have been thinking about the culture of AI existential risk (AI Doom) for some time now. I have already written about it as a cult phenomenon, nor am I the only one. But I believe that’s rather thin. While I still believe it to be true, it explains nothing. It’s just a matter of slapping on a label and letting it go at that.

While I am still unable to explain the phenomenon – culture and society are enormously complex: Just what would it take to explain AI Doom culture? – I now believe that “counter culture” is a much more accurate label than “cult.” In using that label I am deliberately evoking the counter culture of the 1960s and 1970s: LSD, Timothy Leary, rock and roll (The Beatles, Jimi Hendrix, The Jefferson Airplane, The Grateful Dead, and on and on), happenings and be-ins, bell bottoms and love beads, the Maharishi, patchouli, communes…all of it, the whole glorious, confused gaggle of humanity. I was an eager observer and fellow traveler. While I did tune in and turn on, I never dropped out. I became a ronin scholar with broad interests in the human mind and culture.

I suppose that “Rationalist” is the closest thing this community has as a name for itself. Where psychedelic experience was at the heart of the old counter culture, Bayesian reasoning seems to be at the heart of this counter culture. Where the old counter culture dreamed of a coming Aquarian Age of peace, love, and happiness, this one fears the destruction of humanity by a super-intelligent AI and seeks to prevent it by figuring out how to align AIs with human values.

I’ll leave Bayesian reasoning to others. I’m interested in AI Doom. But to begin understanding that we must investigate what a post-structuralist culture critic would call the discourse or perhaps the ideology of intelligence. To that end I begin with a look at Adrian Monk, a fictional detective who exemplifies a certain trope through which our culture struggles with extreme brilliance. Then I take up the emergence of intelligence testing in the late 19th century and the reification of intelligence in a number, one’s IQ. In the middle of the 20th century the discourse of intelligence moved to the quest for artificial intelligence. With that we are at last ready to think about artificial x-risk (as it is called, “x” for “existential”).

This is going to take a while. Pull up a comfortable chair, turn on a reading light, perhaps get a plate of nachos and some tea, or scotch – heck, maybe roll a joint, it’s legal these days, at least in some places – and read.

]]>https://www.youtube.com/watch?v=02kbWY5mahQ&ab_channel=AlexTurner

Scott, what do you think? Anyway, I think this is the best place to post it.

]]>This is an attempt to generalize and strengthen the results in https://arxiv.org/abs/1905.10074 to general quantum circuits and to prove the BPP=BQP (below just covers decision problems with no intermediate measurements.)

The paper is about using hash functions to reduce the number of qubits required for Simon’s algorithm, Shor’s algorithm etc. There is some speculation that this could be used to prove BPP=BQP, if an appropriate homomorphic hash function could be found.

Consider the following setup.

For a quantum circuit with n qubits, g gates, no intermediate measurements, and 1 decision qubit, create a new circuit with a 2-qubits window for applying unitary gates and one $O(log n)$ qudit formed by hashing the bases of the remaining qubits with a universal hash function. For each gate in the original circuit, perform an operation which amounts to unhashing the state, rotating the state so as to expose a different two qubit window and rehashing. Then apply the unitary gate to the new 2 qubit window. At the end of the circuit, compute the probability of the answer qubit measuring to 0 or 1. The 2-qubit window is chosen for convenience so arbitrary gates will be universal for quantum computation.

The universal hash function is

$$

h_{a,b} = (ax + b \mod 2^{w+M}) \div 2^w

$$

The reason for choosing this universal hash function is that it is well adapted to working with qubits (it would be less suited to qutrits) and that “inverting” it is easy. The hash function is applied by treating the qubits’ bases as integers $x$ and taking each basis vector $x$ to its hashed basis $h(x)$.

Rotating to a new qubit window means adding and subtracting powers of 2 to $x$ like $x \to x + 2^k – 2^l$. Then, when we rotate the 2-qubit window the result is to shift the hashed state left or right. It’s not an exact permutation, since basis vectors in the same bin can wind up in different, yet neighboring bins with a calculable probabilities, $p$ and $q$, such that $p+q=1$. As such, we can approximate the results of unhashing, rotating, and rehashing the state.

Not being a true permutation, means that the matrix fails to be unitary. (As a sub-question what is the probability distribution of the determinant of this matrix?) However, if we coarse-grain the result of $UU^\dagger$ by merging neighboring bins, the result will, with a linear factor of mergers, quickly approximate unitarity.

Now, the question is whether the circuit described above can approximate quantum circuits in polynomial time. The argument consist of two steps: 1) the error introduced by each layer, that is, a unhash-rotate-rehash step and a unitary gate is bounded and 2) these errors accumulate over the course of the circuit slowly enough that the circuit can approximate the quantum circuit in polynomial time.

For 1), May and Schlieper provide an argument for specific circuits that the error in probability for a universal hash function is $O(1/m)$ (so the probability amplitude error would be expected to be around $O(1/\sqrt m)$). Their proofs requires that certain probability amplitudes sum to 0 which can’t be assumed for general quantum circuits.

The error \emph{introduced} by each layer can be estimated calculating the variance and standard deviation of the probability and probability amplitude errors. For $m$ bins, rotations have variances about $O(1/m)$ and standard deviations around $O(1/\sqrt{m}$. ( $O(\frac{1}{m^2})$ probability error per bin for $O(\frac{1}{m}$ probability error overall and $O(\frac{1}{m})$ probability amplitude error per bin and $O(\frac{1}{\sqrt{m}}$ probability amplitude error overall. So, errors are small enough per layer.

For 2) the main concern is that since the unhash-rotate-rehash matrices are not unitary they could magnify already existing errors (unitary matrices would keep them the same magnitude). The idea is that we can show that the circuit just consists of quantum operations — unitary gates and measurements — which can’t stretch the errors out, so that errors add up at most linearly. That is that hashed state is described as a reduced density matrix of the true quantum circuit.

For the hashed circuit’s density matrix, the unhash-rotate-rehash steps are formed from the sum of unhashed rotations, so it should be described as quantum operation (further mixing the a quantum state).

Otherwise, we can think of hashing as changing the basis of the unhashed qubits to one with the hashed qubits (assuming $m$ is a power of 2) and the remaining qubits and then summing out the “environment” leaving only the 2-qubit window and the hashed qubits. Then, the quantum operation argument applies and neither unhash-rotate-rehash steps nor 2-qubit unitary gates should be able to stretch out errors, so the error should accumulate at most linearly across gates. (Otherwise entangling a state with the environment would allow quantum cloning.)

Looking at probability amplitudes, if the errors grow linearly, the running time will be around $O(\frac{g^3}{\epsilon})$, where $\epsilon$ is the allowed error in the decision qubit, and if the probability amplitudes grow like a square root, the running time should be around $O(\frac{g^2}{\epsilon})$, which is what we would expect if the randomizations of layers are independent of each other and variances add up.

]]>