A Vision of Metascience

An Engine of Improvement for the Social Processes of Science

By Michael Nielsen and Kanjun Qiu
October 18, 2022

How does the culture of science change and improve? Many people have identified shortcomings in core social processes of science, such as peer review, how grants are awarded, how people are selected to become scientists, and so on. Yet despite often compelling criticisms, strong barriers inhibit widespread change in such social processes. The result is near stasis, and apathy about the prospects for improvement. People sometimes start new research institutions intended to do things differently; unfortunately such institutions are often changed more by the existing ecosystem than they change it. In this essay we sketch a vision of how the social processes of science may be rapidly improved. In this vision, metascience plays a key role: it deepens our understanding of which social processes best support discovery; that understanding can then help drive change. We introduce the notion of a metascience entrepreneur, a person seeking to achieve a scalable improvement in the social processes of science. We argue that: (1) metascience is an imaginative design practice, exploring an enormous design space for social processes; (2) that exploration aims to find new social processes which unlock latent potential for discovery; (3) decentralized change must be possible, so outsiders with superior ideas can't be blocked by established power centers; (4) ideally, change would align with what is best for science and for humanity, not merely what is fashionable, politically popular, or media-friendly; (5) the net result would be a far more structurally diverse set of environments for doing science; and (6) this would enable crucial types of work difficult or impossible within existing environments. For this vision to succeed metascience must develop and intertwine three elements: an imaginative design practice, an entrepreneurial discipline, and a research field. Overall, it is a vision in which metascience is an engine of improvement for the social processes and ultimately the culture of science.

How does the culture of science change?

Imagine you're a science fiction author writing a story depicting a scientific discovery made by an alien species. In your story you show the alien scientists up close – how they work, how they live. Would you show them working within institutions resembling human universities, with the PhD system, grant agencies, academic journals, and so on? Would they use social processes like peer review and citation? Would the alien scientists have interminable arguments, as human scientists do, about the merits of the h-index and impact factor and similar attempts to measure science?

Almost certainly, the design of our human scientific social processes has been too contingent on the accidents of history for the answers to those questions to all be "yes". It seems unlikely that humanity has found the best possible means of allocating scarce scientific resources! We doubt even the most fervent proponents of, say, Harvard or the US National Science Foundation would regard them as a platonic ideal. Nor does it seem likely the h-index and similar measures are universal measures of scientific merit.

This doesn't mean the aliens wouldn't have many scientific facts and methodological ideas in common with humanity – plausibly, for instance, the use of mathematics to describe the universe, or the central role of experiment in improving our understanding. But it also seems likely such aliens will have radically different social processes to support science. What would those social processes be? Could they have developed scientific institutions as superior to ours as modern universities are to the learned medieval monasteries?

The question "how would aliens do science?" is fun to consider, if fanciful. But it's also a good stimulus for immediately human-relevant questions. For instance: suppose you were given a large sum of money – say, a hundred million dollars, or a billion dollars, or even ten or a hundred billion or a trillion dollars – and asked to start a new scientific institution, perhaps a research institute or funder. What would you do with the money?

Would you aim to incrementally improve on the approaches currently taken by Harvard, the NSF, HHMI, and so on? Or would you attempt to create radically new institutions, which transcend existing institutions, a new organizational approach to doing science? Not just in the sense of new scientific ideas or methods in specific fields, but rather new social processes for science; that is, new ways to select and support human beings to make discoveries? Our distant ancestors did not, after all, anticipate the immense improvements possible in humanity's discovery ecosystem1. Perhaps, with sufficient insight, further transformative improvements are possible?

These questions aren't just hypothetical. Lots of concrete work has been done on the question: "what's wrong with the social processes of science, and how can we improve them?" Some of this work is in papers, essays, and manifestos explaining how to fix or improve peer review or funding or hiring or the career structure of science, and so on. Early entries in the genre include such celebrated works as Francis Bacon's The New Atlantis and Novum Organum, and Vannevar Bush's Science – The Endless Frontier. And, of course, there is much modern work: from fields including the economics of science funding, the science of science, science and technology studies, science policy, and others; in the mass media; and on social media and in informal conversation amongst scientists.

Alongside these proposals, many adventurous people are building new and sometimes daringly different scientific organizations. There are big new research institutes, such as the Arc Institute, DeepMind, and Altos. There are tiny pirate insurgencies, such as the Center for Open Science, DynamicLand, EleutherAI, and dozens or perhaps even hundreds more2. There are new funders, such as Convergent Research, Fast Grants, the FTX Foundation, VitaDAO, and many more. Of course, many of these are differentiated in part by their specific research focus: for example, DeepMind was one of the first large organizations focused on artificial intelligence research. But many also have theses based in part on new or unusual approaches to the basic social processes of science. And when we talk to the founders of such organizations they often express hope that not only will their organization succeed, it will be a beacon, succeeding so spectacularly that the underlying social ideas will spread widely, improving humanity's discovery ecosystem as a whole.

How realistic is that hope? Does our discovery ecosystem improve in response to successful experiments with new social processes? Or is it resistant to change, only improving slowly? In a nutshell, this essay explores the question: how well does the discovery ecosystem learn, and can we improve the way it learns? As we address this question, many related questions naturally arise: does the discovery ecosystem enable the rapid trial of a multitude of wildly imaginative social processes? Or are only tiny, incremental changes ever possible? Can outsiders with great ideas displace existing approaches? Or can change only come from people and organizations who already have enormous power?

What we'll find is a discovery ecosystem in a state of near stasis, with strong barriers inhibiting the improvement of key social processes. We believe it's possible to change this situation. In this essay we sketch a vision in which metascience drives rapid improvement in the social processes of science. This vision requires a strong theoretical discipline of metascience, able to obtain results decisive enough to drive the adoption of new social processes, including processes that may displace incumbents. It also requires a strong ecosystem of metascience entrepreneurs, people working to achieve scalable change in the social processes of science. In some sense, the essay explores what it would mean for humanity to do metascience seriously. And it's about placing that endeavor at the core of science. We believe the net result will be a portfolio of social processes far more structurally diverse than today, enabling crucial types of work difficult or impossible within existing environments, and so expanding the range of possible discoveries.

As far as we are aware, these questions have not previously been explored in depth. To make the questions more concrete, let's sketch a few specific examples of unusual social processes that could be (or are being) trialled today by adventurous funders or research organizations3. These sketches are intended as brief illustrative examples, to evoke what we mean by "changed social processes". Though the examples are modest and conservative – indeed, some ideas may be familiar to you4, though perhaps not all – versions scaled out across science would significantly change the culture of science. It's a long list, to emphasize the many diverse opportunities for imaginative change. Later in the essay we develop deeper ways of thinking that generate many more ideas for change.

With these examples in mind, we may restate the basic questions of the essay. Suppose, for instance, that the first of these ideas, "fund-by-variance", was given a serious trial, perhaps with multiple rounds of debugging and improvement. Suppose it was found that when implemented well it was a decisive improvement over the committee-based peer review approach used today by many funders. Would it then be adopted broadly? Or would other institutions ignore or resist it? In a healthy, dynamic discovery ecosystem it would spread widely, displacing inferior methods when appropriate. By contrast, in a static ecosystem, even if the early trials were extremely successful, other institutions would be slow to respond, or resistant to change. They would get hung up over whether the approach came from the "right" originator, or was prestigious enough. In a healthy discovery ecosystem the improved idea could come from anywhere.

In early drafts of this essay we were hesitant about writing the concrete list of ideas above. We worried that it would anchor readers on "these are the changes Nielsen and Qiu are arguing for". But the individual programs are not the point; indeed we'll suggest many more (and sometimes deeper) ideas later in this essay. Rather, the point is that a flourishing ecosystem would rapidly generate and seriously trial an enormous profusion of ideas, including many ideas far more imaginative than anything listed above. The best of those ideas would be rigorously tested, iterated on, debugged, and scaled out to improve the entire discovery ecosystem. Indeed, if truly bold ideas were being trialled, then they would include many ideas we would at first disapprove of, but sometimes the evidence for them would be so strong that we'd be forced to change our minds.

As stated above, the focus of this essay is how the discovery ecosystem improves. Part of the motivation for this focus is the belief that the design space for promising new social processes is vast:

We won't prove this belief. But we'll try to make it plausible. In Part 1 of the essay we'll discuss heuristics for exploring this metascience design space. These heuristics arise out of plausible models of how human beings make discoveries. Indeed, all the social processes of science reflect and are grounded in such models – often implicit or informal folk theories – of how discovery happens. Weak metascientific ideas result in weak social processes; stronger metascientific ideas result in stronger processes. Insofar as we can improve our metascientific theories, we can improve the way human beings make discoveries. A good way to develop that understanding is to explore boldly in the design space above, understanding which ideas do and don't work, and why.

Throughout this introduction we've used many different-but-related terms, talking about changing the social processes of science; changing the culture of science; changing the institutions of science; changing the discovery ecosystem; and so on. From now on we'll use "social processes of science" as an informal catchall. By that we mean the institutional practices, incentives, norms (and so on) widely used in science. So when we talk about change to the social processes of science we are talking about changes to things like peer review, or hiring practices, or how funders approach risk, as well as broader ideas, such as the "pull immigration" or "printing press for funders" or "anti-portfolio" ideas mentioned above. The phrase "social processes of science" is unfortunately unwieldy. But it is nonetheless very helpful to have such a catchall. We will save the use of other terms for when we have a more specific need.

As noted earlier, the essay is an entry in the field of metascience. This still emerging field overlaps and draws upon many well-established fields, including the philosophy, history, and sociology of science, as well as newer fields such as the economics of science funding, the science of science, science policy, and others. While we draw on all these fields, there are notable differences in our focus. Unlike the philosophy of science, we are more concerned with social processes than methodology. The two are intertwined, so this is a difference of degree, not kind, but is nonetheless real. Of course, fields such as the sociology of science and the science of science do focus on social processes, but the focus is primarily descriptive, not imaginative design and active intervention, as discussed in our work. The exception is science policy, which has design and intervention as core goals. However, in the science policy world interventions are often focused on what is practical within existing power structures. We shall be concerned with more a priori questions of principle, and enabling decentralized change, i.e., change that may occur outside existing power structures. For all these reasons we think of the essay as simply part of metascience.

One factor not discussed in the main body of the essay is the relationship between metascience and several external factors affecting the future of science (artificial intelligence, the rise of China and India, the colonization of space, and intelligence augmentation). We discuss these briefly in an Appendix to the essay. The focus in the body of the essay is more endogenous to science.

Underlying the essay is the assumption that improved social processes can radically transform and improve science. Most scientists we've spoken with agree with at least a weak form of this assumption. For instance, many strongly advocate metascientific principles like: the importance of freedom of inquiry for scientists; or that it strengthens science if outsiders can overturn established theories on the strength of evidence, not their credentials. For science to work well, such ideas must be expressed, even if imperfectly, in the design of the social processes of science. As said above, the quality of our social processes (and of our institutions) is determined by the quality of the metascientific ideas they embody.

But our assumption in this essay is much stronger than that weak form. Again: we believe improved social processes can radically transform and improve science. It's not so obvious this is true. Some scientists we talk with are excited by this idea, and agree that new social processes may be transformative. Others sharply disagree, telling us that broad, ecosystem-level social changes make little substantive difference to the actual science. Indeed, several notable scientists have expressed this to us using minor variations on a single phrase: "all that matters is to fund good people doing good work". Several others have said words to the effect of: "I admire your optimism, but the system never really gets any better, there's just more and more bureaucracy and 'accountability'". It's possible those people are correct. But the only way to determine that is to actively explore the question: what if there are truly transformative social processes waiting to be discovered?

We'll proceed on that assumption for now, and return at the end of the essay to reconsider whether it is true. For now, let us merely recount an anecdote from the legal scholar and computer scientist Nick Szabo22. Szabo writes of how, during the early Renaissance, exploring the oceans was an extremely risky business. Ships could run aground, or be blown badly off course by storms. Sometimes entire crews and ships and cargoes were lost. There were risks at all levels of an expedition, from the health and livelihood of individual sailors through to financiers who faced ruin if the ship ran aground or was badly damaged. But, Szabo points out, this risk profile changed considerably in the 14th century, when Genovese merchants invented maritime insurance: for the cost of a modest premium, the people financing the expedition would not suffer if the ship was damaged. This spread the risk, and made the expedition much less risky for some (though not all23) participants. That change in the funding system helped enable a new age of exploration, discovery, and prosperity.

It is easy to imagine a salty Genovese sea captain, upon being asked how to improve shipping, saying that you "just need good ships, crewed by good sailors". This would be in the vein of our scientist friends telling us "just fund good people doing good work". It contains a large grain of truth, but is not incompatible with system-level ideas radically improving the situation. The salty scientists are correct, but only within a limited outlook. Research organizations do need to be maniacal about funding good people with good projects; they can also make system-level changes which have much more profound effects. This essay is about such system-level changes.

Put another way: we believe our salty scientists are blind to just how strongly systems and social processes shape creative work. This isn't because they're unimaginative. Often, it's because the systems change they've seen during their own careers is mostly bureaucracies making themselves happier (and often everyone else unhappier), with more red tape and demands for accountability. Naturally, such scientists are cynical about the prospects for improvement. We hope to make a convincing case that much more unconventional changes are possible, resulting in a profoundly different and better discovery system.

A variation of this line of critique is that the current social processes of science already produce many terrific outcomes. This is certainly true! It's a humbling experience to talk to the best scientists: what they can do is genuinely astounding. And you think to yourself: "this is it, these are extraordinary human beings, being used by humanity to near their full capacity". It's a wonderful achievement of humanity that we do support such people. And it's reasonable to ask why we'd need anything else. Why not just scale this up? Indeed, sometimes when talking with such people we encounter friendly or skeptical bewilderment. For them and their friends, the current system works well, and they don't see the need for anything different. But perhaps there are very different types of scientist (and scientific work) who could also achieve astounding things in science, perhaps achievements which the current system is unknowingly bottlenecked on, because their personality type doesn't thrive within that system? And perhaps they, and their approach to science, would thrive if there was more structural diversity in the social processes of science? This is a central point to which we shall return.

At first glance, this essay may appear to be an entry in that flourishing genre, what's-wrong-with-science-and-how-to-fix-it? This genre is well represented on social media, in conversation among scientists, and in articles in the scientific and mainstream media. "Here are the things wrong with peer review [or grant agencies, or universities, or etc…] – and how to fix them." Indeed, the genre is not new: you can find discussions of these issues going back decades and even centuries. Each generation confronts the problems anew, and proposes solutions anew. But although there is no shortage of grand hopes and plans, progress is often slow.

Our point of view is different in a crucial way: we are not proposing a single silver bullet. We believe the opportunity is far larger. What we want is a flourishing ecosystem of people with wildly imaginative and insightful ideas for new social processes; and for those ideas to be tested and the best ideas scaled out. We will show, in part by example, that there are many different possible approaches to fixing peer review [or grant agencies, universities,…]. Instead of believing we already know the answers, and just need to implement them, it's better to develop a discovery ecosystem which can rapidly improve its own processes. The fundamental underlying questions are: How do social processes in science change? Is there a general theory of such change? Is it possible to speed up and improve that change? This subject isn't fashionable in the same way as yet another proposal for "how to fix grant agencies" or "how to fix universities". But we believe it is a fundamental issue in the way human beings make discoveries, is a central problem of metascience, and at the very core of science24.

As a final note, we've used informal language throughout the essay, and that may make some readers mistake it for journalism. But while it is in part a synthesis, that isn't the primary intent. Rather, it's intended as a creative research contribution: a broad vision of the purpose and potential of metascience, and how it can change science. We introduce terminology and simple models for many key elements of metascience, and sketch many core problems. Our arguments unavoidably sometimes use speculative and incomplete reasoning. We shall borrow from many existing fields, but our work isn't primarily intended as a contribution to those fields. Rather, it's intended as a sketch of part of the emerging proto-field of metascience, to help it along the way to becoming a fully fledged field25.

Part 1: Exploring the metascience design space

One conception of metascience is that it's about fine-tuning science, making incremental tweaks to social processes such as peer review or grant-making. But we conceive of metascience differently, believing radically different and far better social processes are possible, and that the metascience design space is vast and mostly unexplored:

Indeed, we believe the design space is so large that exploring it will require decades or centuries, at the least. Still, in Part 1 we hope to evoke something of that grand size. We'll explore the space, focusing initially on simple program ideas that could be trialled unilaterally by some imaginative funder. Although this initial focus is restricted, it can be used to illustrate generative design heuristics that help explore in imaginative ways. Later we'll broaden our scope. Along the way we'll sometimes run across well-known ideas – things like the currently fashionable idea of funding lotteries, or the idea that one should fund people, not projects. But to keep the discussion fresh we'll also mention ideas that have only rarely been discussed, or which to our knowledge are novel.

The funder as detector-and-predictor: one heuristic for exploring the metascience design space

Let's begin with a simple heuristic for exploring the design space. We call this the funder-as-detector-and-predictor model26, or just the detector-and-predictor model, for short. As the name implies, this is a two-part model. In one part of the model, we think of a science funder as a kind of detector or sensor27, a collective human instrument aiming to locate intellectual dark matter. That is, it aims to locate important ideas or signals present in the discovery ecosystem, but ignored by existing funders. For example, the Century Grant Program aims to elicit a class of previously invisible intellectual dark matter – ideas for projects that should last a century or more. There may be many great ideas for such projects out there; there may be few such ideas. We won't know unless we do a determined search! In this part of the model, crucial questions to ask include: what types of important signal are present in the system, but are currently ignored? Is there information which is systematically being hidden; and, if so, how might it be elicited? And what new mechanisms can we develop to locate and amplify signal28? Concretely: what do the body of scientists know that is important, and yet currently either invisible (or not visible enough) to funders? And how can we surface that information?

In the second part of the detector-and-predictor model, funders are thought of as predictors, trying to predict future outcomes. In particular, they use some inference process to make decisions about an uncertain future, on the basis of incomplete current information. (This is the underlying problem to be solved in funding discovery.) As an example, the idea of high-variance funding is based on a simple change to the inference method used to make decisions: instead of using typical or average scores, use the variance in scores to help decide which proposals to fund. In this part of the detector-and-predictor model, crucial questions include: what information might we collect? What hedging and aggregation and indirection strategies might be used? Where is there asymmetric opportunity? Or opportunity for unique marginal impact? What are the possible contractual designs? Where is the risk, and how can it be moved and transformed?

Many of the suggestions we made in the introduction may be understood through the lens of the detector-and-predictor model. We already mentioned the Century Grant Program and high-variance funding, but many others may also be viewed this way. For instance: failure audits are about observing the outcomes from the inference model, in order to determine whether it's achieving some desired end, and using incentives to change the model used. Or: the pull immigration program is about surfacing previously invisible intellectual dark matter. Many of the programs involve both parts of the model: high-variance funding changes, as already mentioned, the prediction method, but will also likely elicit different types of grant proposal, encouraging people with riskier ideas to apply. In this sense, what you detect and how you predict are interwoven. More broadly: it's stimulating to simply look through the earlier examples, and see how the model applies (or not).

The detector-and-predictor model is not intended to be universally applicable, nor to be literally correct as a descriptive model. Rather, it's a generative design heuristic, to help generate plausible, interesting program ideas. By playing with the model you can easily generate an endless supply of potential programs. To illustrate, let's describe four more programs motivated by a view of the funder as a detector searching out intellectual dark matter.

As with the program ideas suggested in the opening section, we're not claiming that any one of these program ideas would revolutionize science. Indeed, we're not even claiming any particular one would work well at all; some might work quite poorly (though we're not sure which!) A healthy discovery system should trial a profusion of ideas, including many which fail; that's what it means to be trying risky things. We do believe it's worth trialling all the ideas above, and many more. Conducting such trials would help answer an immense variety of questions, things like: how much demand is there for discipline switching? What are the resulting flows of scientists between disciplines? What determines those flows? How well do young people perform as Principal Investigators? Are there systematic differences in the directions they explore? And so on, a cornucopia of questions, partial answers, and useful data. In that sense, even "failed" programs would be successful: they will contribute crucial knowledge to our understanding of metascience. And in the event that one of the programs works strikingly well, it can be scaled up. It may even begin to change the culture of science.

There is a particular sense in which existing funders already change their "detector": they actively search for new research subfields to fund. Consider, for instance, the way the NIH systematically expands their panel areas. Or the way DARPA searches for technological whitespace. But the notion of intellectual dark matter goes much further than that35. The unifying motivating question is: what does the body of scientists know that is important, and yet is currently either invisible (or not visible enough) to funders? For example, FROs are not principally about expanding the range of fields being considered; rather, they are about a change in the structure of scientific problem which may be attacked at all. How do you know such intellectual dark matter exists? You can't. But the success of bespoke prior projects such as LIGO, the LHC, and the human genome project at least suggest it's worth looking. Similarly, the Discipline-switching Fellowship isn't about expanding the range of fields considered, but rather about making use of scientists' knowledge of their own comparative advantage. And again, this is suggested by famous examples: Francis Crick from physics to molecular biology; Ed Witten from mathematics to physics. And so on. If you talk with individual scientists, and appreciate the barriers preventing such switching, you realize the intellectual dark matter exists, and a scalable Discipline-switching Fellowship is natural. Such intellectual dark matter is replete in the history and current practice of science. By searching out specific examples, it's possible to identify many more programs in the vein of those above36.

The programs just described were generated by thinking of funders as detectors. What if we focus instead on funders as predictors, trying to develop new inference procedures to make decisions? Again, there are an immense number of ways this can be done. Here's three program suggestions based on changed prediction methods:

We could explore the generative power of the detector-and-predictor model at much greater length. Each element can be riffed upon endlessly, generating many more program ideas. But that's not our purpose in this essay. Rather, the model is here as an example of a generative design heuristic which can be used to explore one (tiny!) part of the metascience design space. Let's now discuss more briefly several other such design heuristics. Each provides a different way of exploring that design space, while illustrating the broader idea of metascience as an imaginative design practice.

Metascience as an imaginative design practice

The detector-and-predictor-model poses the question: where is there intellectual dark matter, and how to detect and amplify it? The underlying thesis is that there are many such types of dark matter, and by identifying new types we can activate untapped latent potential for discovery. Now, let's briefly discuss four other metascientific questions motivating different ways of exploring the design space. Each question needs an essay or book of its own, and could generate dozens or hundreds of program ideas, each with its own in-depth treatment40. But we believe these brief descriptions evoke what is possible, and help convey different theses about where there is latent potential for discovery.

There are many other plausible questions one might ask, other generative heuristics to explore the design space. Even more briefly: (1) In biology evolutionary innovation often follows catastrophe; this happens in markets too; would it lead to an explosion of discovery if we temporarily but drastically decreased (and then increased) funding to entities such as the NSF? We don't expect this suggestion to be popular, but that does not mean it's wrong. (2) Can we use cryptoeconomics to radically improve the political economy of science, creating far stronger alignment between individual incentives and collective social good48? (3) Can we diversify exploration and unleash creativity by decreasing the amount of grant overhead, since overhead incentivizes universities to follow grant agency fashion, and is thus a strong centralizing force49?

These questions are different to "where is there intellectual dark matter, and how to detect and amplify it?" But each expresses a different design heuristic, helping us explore different parts of the metascience design space50. As we noted earlier, the value of those heuristics lies not in their descriptive correctness or universality51, but rather in their ability to help generate good new design ideas. Each is based upon a plausible broad thesis about where there is latent potential for discovery, and sketches mechanisms to unlock that potential. The greater the potential and the better it is activated, the more transformative.

Underlying these heuristics is a view of metascience as an imaginative design practice. It's a view very different from that common in the natural sciences, which are most often about more deeply understanding existing systems, or natural variations thereof52,53. Design, by contrast, is about inventing fundamental new types of object and action, which don't obviously occur in nature. Consider Genovese maritime insurance as an example. It wasn't a change to how ships were built, or sailors trained. Rather, it introduced a radical new interlinked set of abstractions – the insurance premium, counterparties, spreading of risk, insurance payouts. These are beautiful, non-obvious ideas, none of which naturally occur in the world. Rather, they were invented through deep design imagination. And despite being "made up", they transformed humanity's relationship to the world. This is characteristic of design imagination. It must have seemed to the Genovese that financiers were "naturally" reluctant to finance expeditions, given the risks, and that this was a fixed feature of the world. And yet design showed that this was an illusion, which could be radically changed.

We mention this because discussions of metascience often give short shrift to imaginative design. We often meet people who think metascience means studying relatively minor tweaks to existing social processes, for things like peer review, hiring, granting, and so on. Those minor tweaks are genuinely valuable, and can teach us much. But we believe that imaginative design can be at the core of metascience. That means inventing fundamental new primitives for the social processes of science. It means developing tremendous design imagination and insight and new ideas to explore the metascience design space. We believe the most important and powerful social primitives in this design space are yet to be discovered.

It's challenging to illustrate this idea of imaginative design as well as we should like. That's in part because truly imaginative design is hard. The individual program ideas we've described illustrate no more than rather modest levels of imagination. A few – ideas like the failure audits or tenure insurance or the Century Grant Program – do involve moderately striking new design ideas (not all ours!). But they're not as imaginative as maritime insurance was in its time. Still, we hope you'll think of individual program ideas like these as points in a pointilist sketch to evoke metascience as a design practice. And we're confident that deeper and more imaginative ideas are possible than anything we gesture at here. Developing such ideas is part of the challenge and opportunity of metascience.

There's no reason to expect scientists to be good at this sort of design. Scientists are users of the discovery system, not (for the most part) designers. There's no reason they should deeply understand how to improve it, any more than someone who drives a car should understand how to design and build a great car. A good driver will notice problems with their car, and may have important insights about cars. But that doesn't mean they'll understand the origins of those problems in the design, or how to fix them, or how to design new and better cars. Just because someone is good at science doesn't mean they have the skills of a good designer. Worse, they're sometimes convinced they know, and will ignore or hold in low regard people who actually have more insight about these things. It's all "soft skills", not "real knowledge of science", in this view, and so how could an outsider have anything useful to say? Contrariwise, just because someone has strong convictions about the social processes of science, doesn't mean they actually have much insight. Indeed, we confess to much self-doubt on this point. On this issue, humanity is still figuring out how to tell the difference between people with insight and people who merely have strong convictions about how science should be54.

We've been developing the idea of metascience as an imaginative design practice. In Part 2 we'll argue that this is one of three major components of metascience. The other two components are: (1) metascience as an entrepreneurial discipline, actually trialling and then scaling out new social processes; and (2) metascience as a research field, aimed at deepening our understanding of the social processes of science, in part as a tool to evaluate their impact on discovery. All three components must work together for metascience to be successful.

The point about scaling out deserves amplification. Our discussions so far have been of modest pilot trials that could be done unilaterally today. If such ideas were scaled out broadly, they would change science far more profoundly than the trials: they would change the culture of science, i.e., the ambient environment and background working assumptions scientists take for granted.

As a concrete example, suppose programs to support much higher risk work were scaled out broadly. Done well, the culture would change so that it was routine for scientists to pursue extremely ambitious and risky projects. Taken far enough, some scientists may begin to worry about their plans not being risky enough, rather than too risky55. This change in culture would have many follow-on effects: in how scientists choose what to work on; in how they develop and change over the course of their working lives; in the personality characteristics of the people who choose to go into and remain in science; and so on. We believe it's not an exaggeration to say that such a change would transform the culture of science. And it would be a qualitatively different kind of change than merely a trial program.

That's just one example. Many others could be given. But we trust the broader points are clear: (1) large cultural shifts are different in kind from unilateral pilot trials, even if the cultural shift is "merely" making widespread some idea of the trial; and (2) each of the trial ideas above have natural extensions as part of broader cultural changes.

Summarizing, and looking ahead, in the vision that will emerge, metascience is not just about the study of science, understanding descriptively what is happening. It has as a fundamental goal interventions to change science. And metascience is not only about incremental interventions. It is also about wildly imaginative design, conjuring new fundamental elements for social processes in science. This is what we mean when we say metascience is an imaginative design discipline. Furthermore, metascience is not just theoretical. That is, it is not just about new understanding (and papers), in the conventional academic mode. It requires building new organizations, new programs, new tools, and new systems. Only by such entrepreneurial building is it possible to test metascientific ideas, and to improve our understanding. That improvement is both valuable in its own right, and also improves what people can build in the future. At the same time, it is not sufficient just to build local systems. Metascience also requires moving from (comparatively) small trials to broader cultural changes in science. That is the main subject of Part 2 of the essay.

Part 2: The decentralized improvement of the social processes of science

Bottlenecks inhibiting decentralized improvement

Imagine you're a graduate student with a brilliant idea to improve the way science is funded. Chafing under what you perceive as a flawed academic system, you learn about the history of science and alternate funding models. You talk with many scientists, and delve into alternate models of the allocation of resources – ranging over fields such as finance, organizational psychology, and anthropology. You develop and discard many ideas; over time, your ideas get more imaginative and insightful. And you gradually develop insights you believe would enable you to create a new funding agency, one that would, given time and resources, be vastly superior to the status quo. You raise pilot funding, and begin to operate. You identify some misconceptions in your ideas, and improve them still further. Suppose you did all this, and your idea genuinely was vastly better for science than existing funders, such as the NSF and NIH. Would your funder rapidly grow to be larger and more influential than the NSF and NIH? This has not happened in the modern era. Nor has it happened that such an outsider has rapidly grown a research organization to a scale outstripping Harvard and Cambridge and other incumbents. Examples like Janelia and Altos might seem superficially to be examples, but they are not: they didn't grow because they were better; rather, they were simply endowed in advance by wealthy donors. Indeed, the possibility of such growth happening seems almost ludicrous. Garage band research organizations don't grow to worldwide pre-eminence56. But we'll argue that this kind of change is both highly desirable and potentially feasible in science.

One tremendous strength of modern science is that a similar phenomenon does happen routinely with scientific ideas: outsiders or people with little power (e.g., graduate students) replacing established ideas with better ones. There are many famous examples. Think of the way the young and unheralded Francis Crick, James Watson, and Rosalind Franklin triumphed over Linus Pauling in the race to decode the structure of DNA. Pauling was the most famous chemist then alive, and he announced his (incorrect!) structure for DNA first; Crick, Watson, and Franklin were scrappy outsiders, apparently beaten to the punch. But they were right, and Pauling was wrong, and their structure57 was almost immediately accepted by the scientific community, including Pauling(!) Or think of the 22 year old Brian Josephson, whose work on superconductivity was publicly rebutted in a paper by John Bardeen, the only person ever to win two Nobel Prizes in physics. But Josephson was right and Bardeen was wrong, and the physics community quickly sided with Josephson. Or perhaps the most famous example of all: the 26 year old Albert Einstein, a patent clerk who proposed new conceptions of space, time, mass, and energy. And in almost no time his ideas triumphed over the old.

These examples are exalted. But something similar happens routinely at a much less exalted level. One of the best things a graduate student can do in their PhD is convincingly show that a celebrated idea of one of their elders is incomplete (often a gentle way of saying "wrong") or needs extension. That's the way careers are made, and the ideas of science are updated and improved. While this process is often bumpy, this replacement or improvement of scientific ideas is genuinely routine. It's foundational for science, and happens so often that it's tempting to take for granted. But it only happens because extraordinary institutions ensure that good ideas from relative outsiders can get a fair hearing, even when they contradict established wisdom58. The value of such decentralized change in ideas has been understood since at least the time of Francis Bacon, who argued for the primacy of experiment and against the received authority of Church and State. And it was baked into the Royal Society's motto, nullius in verba, take no-one's word for it, chosen in 1660, and still used today. Of course, our scientific institutions don't always achieve this ideal of supporting decentralized change in ideas! There are many cases – think of Lynn Margulis or Gregor Mendel or Alfred Wegener – where the establishment resisted new ideas long after they should have been taken seriously. But nonetheless our scientific institutions do remarkably well, protecting and testing novel ideas, and amplifying genuine improvements, even when they come from outsiders.

What about the analogous process for updating the social processes of science? We began this section with the concrete example of a graduate student attempting to displace today's funders or universities. But the idea can also be interpreted more broadly, to include widespread changes to social processes such as peer review, funding, hiring, and so on, such as the alternatives discussed in Part 1. In an ideal world there would be a means by which many ideas for new social processes could be easily trialled, and then taken quickly through the following metascience learning loop59:

Many of the steps in this diagram could be accomplished today by daring and imaginative funders. But some of the steps are very difficult to accomplish today. In particular, suppose you trial some new social process, and find it greatly superior to existing processes. How would you scale it? In this section, we argue that there are many strong forces inhibiting scaling, enough so that many existing social processes in science are in a state of near stasis, almost unable to change. More broadly, in Part 2 we'll discuss whether and how it is possible to move out of this stasis, so the social processes of science can be much more rapidly improved.

When we criticize (say) funders for being near stasis, we're often immediately told: "that's not true, funders try new processes all the time! Just look at all the interest in funding lotteries!" Funding lotteries are the idea that instead of deciding grant outcomes through a peer review process, they should fund grants at random from among the pool of applicants (usually after filtering out obviously kooky ideas). The hope is that this will increase the diversity of project ideas submitted. The idea seems to have first been seriously suggested by Daniel Greenberg in 199860. It's been developed over the quarter century since, and began to receive small-scale trials in the latter part of the 2010s. Now, in the early 2020s, funding lotteries are a fashionable topic of investigation. And it's plausible that over the next decade or so they will be widely deployed, although we doubt they'll become the dominant mode of funding.

Funding lotteries are genuinely interesting, and we're pleased to see serious trials. But we're unimpressed by them as a rebuttal to the charge of stasis. They're rather the exception that proves the rule. It's hardly dynamism when it takes a quarter century to get widespread interest in and serious trials of one idea! There should have been (at least) a hundred ideas equally or more ambitious trialled over the same period. A flood, not a trickle. Most of those ideas would have failed, or been qualified successes. And, ideally, a few would have succeeded so decisively that they'd now be deployed at scale. Funding today ought to look vastly different than a quarter century ago. Not just different in the sense of making the administering bureaucracies happier – more process, more red tape, more requirements for "accountability". No: vastly scientifically better, enabling an explosion of discovery. When The New York Times ran a laudatory 2020 article about encouraging results from small-scale trials of funding lotteries, they briefly noted the response of the NSF and NIH: "The U.S. National Science Foundation and the National Institutes of Health say that they have not tested lotteries and don't currently plan to do so." Improving their approach isn't a top institutional priority and matter of urgency for those agencies. It's something to do only if it doesn't interfere with other priorities.

We believe there are four main reasons scalable change to social processes is hard in science. First is the centralization of control over science in a small number of large funders and influential research institutions61. If you ask scientists to tell you how granting, hiring, and peer review are broken, many will tell you about a plethora of problems, and make suggestions for improvement. Unfortunately, many of the suggestions are of the form: "The NIH [or NSF or Nature or Harvard or one of a small handful of other organizations] should [do such-and-such]". This may work if the Director of the NIH or the NSF throws their weight behind the proposal. But that happens rarely. When most resources are under the control of a few organizations which are not designed to undergo radical organizational change, those organizations are a bottleneck preventing improvement.

Second, in many cases there is no single organization or person who can make a change. Again, you hear: "The system needs to change the incentives [or norms or processes] in [such-and-such a way]". Even when true, no-one is singly responsible for processes such as peer review, or the importance of high impact journals, or etc. You can't get a meeting with the Director of Science-as-a-Whole and convince them to throw their weight behind a change. They are, instead, collective action problems. This does not mean single individuals can't have a big impact: if the Director of the NIH went on the warpath against impact factor, for instance, they could make a big difference. But it's still a community-held norm, and requires collective change. Of course, you may complain privately that the incentives are not right, and "something should be done"62. But while that may blow off steam, sensible working scientists mostly just get on with their scientific work.

The third factor, which reinforces the first two, is network effects that homogenize social processes. Over and over we hear variations on: "I'd like to try this new thing – publishing in an unconventional way, supporting students in high risk or unfashionable work, changing to an unfashionable field – but I have a responsibility to my students and collaborators to toe the line right now." There's a tyranny of the community: people won't try unusual things, since their community would look askance; the unusual things therefore never get serious attention; as a result, the community thinks poorly of those very possibilities. Such communities have mechanisms designed to help them collectively change their scientific ideas, but no similar mechanism for changing their social processes. This is compounded by a shadow of the future: people worrying about imagined future judgments of their community. Suppose, for example, someone wishes to experiment with sharing their scientific results in a non-standard way: they must weigh that desire against the (imagined) future negative judgment of some hiring or granting committee. This may seem a small thing, but community judgment is so important in science that it strongly inhibits experimentation.

These three factors badly bottleneck change within existing institutions. The obvious workaround is to build new institutions which by fiat may ignore the first two factors. Such new institutions could, for example, simply declare that publication in high-impact journals is forbidden to employees, or engage in practices strongly encouraging high risk work. But when this is done, the third factor – network effects – influence the startup institution even more strongly. Scientists considering working for Jazzy Not-for-Profit (or for-Profit) Startup Institute must ponder: do they really want to give up publication in high impact journals? Or to work on risky projects that may not pan out? Or to do anything else which violates the norms of their scientific community? If they ever decide to leave their Jazzy Startup Institute job, won't they then have a tough time finding another good job? After all, other potential employers haven't changed their standards, just because Jazzy Startup Institute has. The shadow of the future looms strongly for such ventures, causing a kind of regression to the institutional mean. One of us (MN) has worked in many unusual startup research organizations. The perennial question within such organizations is: if I hew to local aspirations, will that damage my chances of getting a job anywhere else?

Now, such forces could be overcome if Jazzy Startup Institute seemed likely to grow to be far larger than others, so its standards became dominant and replaced those of the existing community. But there's a fourth bottleneck, which is that there's no natural feedback loop driving growth in new institutions. In particular, even if a new institution is scientifically outstanding, that does not mean it will grow to be much larger than existing institutions. Between these four bottlenecking factors, the discovery ecosystem can only change many of its social processes very slowly.

With that said, these bottlenecks only apply to change in certain kinds of social process: those which are collectively held. It is often possible to change social processes when they're not controlled by central agencies, or network consensus, or strongly influenced by community judgment or the shadow of the future. And when we look empirically, we see lots of variation in labs and institutions in just these ways. We've seen labs with very unusual approaches to mentorship, for example, or to hosting visitors, or to seminar culture63, and so on. Such variations in social process fall outside the bottlenecking forces mentioned above, and so can be changed unilaterally. Such changes are of great interest, both from the practical point of view of doing better science, and as objects of study within metascience. But they're not within the scope of this essay. In this sense, the essay is about a vision of a particular aspect of metascience, that aspect concerned with improving social processes that are collectively held, and so subject to these bottlenecks. For the most part, we'll omit the "collectively held" from "collectively held social processes" through the rest of the essay, though it should be understood.

We've argued that stasis affects collectively held social processes in science. This stasis may be illustrated in many ways, and we'll now briefly mention a few. These examples are not meant as dispositive evidence, but rather are merely plausible illustrative examples.

One such example is provided by the Shanghai Rankings, the longest-running ranking of the world's research universities. In the 19 years since being founded in 2003, the Top 10 Universities have merely shuffled around, every single year, with just one exception (#8 in 2003 was replaced in 2004's table). Of course, such rankings are only imperfect, and perhaps do not capture genuine changes within the research ecosystem. But neither does this seem illustrative of dynamism and change64! By contrast, if you consider the top ten technology companies in the NASDAQ index, they've been transformed over the same time period. 2022's biggest companies include Meta (Facebook), which didn't exist at all in 2003; Tesla, which began operation in 2003; and Alphabet (Google), which was a promising but still small private company in 2003. Many of today's other behemoths were relatively small in 2003, like NVIDIA, Amazon, and Apple, all far down the NASDAQ. As we noted at the beginning of this section, it simply wasn't possible in 2003 for a grad student to start their own research university, and to have it grow to be one of the top 10 research universities in the world. And yet the analogous thing did happen with NASDAQ companies. There is far more institutional dynamism in tech than there is in research. This is not an intrinsic fact about the world, but rather a consequence of the ways the institutions are designed, and so can be changed.

As another illustrative example, consider the case of Katalin Kariko, one of the key scientists behind the messenger RNA (mRNA) vaccines that helped end the Covid-19 pandemic that dominated the world in the early 2020s65. Kariko worked in relative obscurity for decades. Never earning more than $60,000 per year, she was ultimately demoted by the University of Pennsylvania. Her grant applications were repeatedly turned down: "Every night I was working: grant, grant, grant… And it came back always no, no, no." A key collaborator said of their attempts to raise funding: "People were not interested in mRNA. The people who reviewed the grants said mRNA will not be a good therapeutic, so don't bother," and summarized the situation: "When your idea is against the conventional wisdom that makes sense to the star chamber, it is very hard to break out". Kariko eventually left the university and academia.

On its own this story doesn't illustrate stasis. Rather, it seems merely to be a story of mistakes made by the NIH and the University of Pennsylvania. But any system will always make mistakes. The issue of stasis arises when we ask whether the discovery ecosystem systematically learns from such mistakes? Have the NIH or the University of Pennsylvania done a serious post mortem on this failure, and made determined changes to the way they do things? The signs are not encouraging. The University of Pennsylvania, which demoted Kariko before she left, now runs advertisements touting the way "Penn researchers made the breakthrough" leading to the mRNA vaccines. And Sudip Parikh, chief executive of the American Association for the Advancement of Science, claims that "The mRNA vaccines are a product of doubling our investment in the NIH"66. He's not entirely wrong: some later work on the mRNA vaccines did benefit from NIH funding. But at the crucial early stages, when support was most needed, it was: "And it came back always no, no, no." If research organizations can and will take credit for discoveries where they manifestly screwed up, how can the discovery ecosystem get any better67?

Kariko's story has been told and retold so many times, in so many high-profile places, that it may become an exception to the rule, and actually lead to genuine change. But the pattern is commonplace. In the late 1980s the molecular biologist Douglas Prasher figured out how to clone the green fluorescent protein (GFP) giving some jellyfish their bright green color. Prasher realized the bright green color made GFP potentially an excellent tracer, a kind of tag that could be used to track the location of cells (in other organisms, not just jellyfish), and thus monitor things like gene expression in an organism68. Unfortunately, Prasher's requests for further funding were turned down; fortunately for humanity, the work was taken up by other scientists, some of whom later received the 2008 Nobel Prize in Chemistry. One of the Laureates stated of Prasher's contribution: "They could've easily given the prize to Douglas and the other two and left me out"69. At the time of the Nobel Prize, Prasher was unable to obtain work in science, and was instead working as a shuttle bus driver for a car dealership. Honest work, making a contribution, but also something that could be done by many people. It's hard not to agree with one of his former colleagues, who said it was a "staggering waste of talent"70.

Again: the issue is not that a mistake was made by the funder. Large-scale systems operating under conditions of great uncertainty will always make mistakes. Such mistakes are an important and indeed inevitable part of the process. The real problem is how to make those systems responsive, so they learn from their errors. And so, as with Kariko, the right question to ask is: has there been a serious post mortem on the error with Prasher? And has there been any systematic change in funding or hiring practice as a result? Again: we know of no major initiative to make such changes. Unfortunately, the discovery ecosystem currently appears to learn little from such errors.

You might respond: wasn't all this work ultimately done anyway? So is the failure to support the work really an indictment of the current discovery system? And does it matter if it doesn't change in response to such apparent errors? The trouble with this argument is that we've no way of knowing what discoveries are not being made at all. We see merely the tip of the iceberg, scorned scientists who managed to just barely find their way through. But how many Katalin Kariko's have been missed? How many are struggling out there right now, perhaps on the verge of leaving science? And how many have already left science, or never even made the entry?

Unfortunately, most of the examples of change in social processes that we know of are changes toward more bureaucracy and "accountability". It's perhaps unsurprising that bureaucracies want more (seeming) control, but it often seems likely that such changes have done as much to hurt science as to help. Peter Higgs, the physicist who won the Nobel Prize for proposing the Higgs boson – in some sense, the ultimate cause of mass in the universe – has stated: "Today I wouldn't get an academic job. It's as simple as that", and described himself as "an embarrassment to the department when they did research assessment exercises"71. David Deutsch, co-founder of the field of quantum computing, conceived quantum computers in the early 1980s without any grant funding. Later, in 1985, Deutsch received a small grant for followup work. In 2018 he asked a member of the granting committee if he would have gotten that grant under 2018 criteria and was told "no chance", that he could not have ticked any of those boxes. In his Nobel Lecture, Sydney Brenner, one of the great molecular biologists, stated: "I want particularly to record the patient and generous support given to me by the Medical Research Council of Great Britain. Such long term research could not be done today, when everybody is intent only on assured short term results and nobody is willing to gamble. Innovation comes only from the assault on the unknown"72. And in 2009 in the The New York Times, Richard Klausner, a former director of the NIH's National Cancer Institute, said73: "There is no conversation that I have ever had about the grant system that doesn't have an incredible sense of consensus that it is not working. That is a terrible wasted opportunity for the scientists, patients, the nation and the world." In the 13 years since, the NIH – the world's largest science funder – has only made relatively small, incremental changes to how it operates.

In this section, we've argued that many of the social processes of science are in stasis, and identified four bottlenecking forces responsible for that stasis. Our argument is not watertight, and we don't expect to convince people who do not want to be convinced. But we believe the argument is plausible enough to proceed. In the remainder of this essay we ask: how can we avoid or weaken these bottlenecking forces, to achieve scalable improvements in the social processes of science?

A prototype of success: the replication crisis and the Renaissance in social psychology

So, is there any prospect for breaking this near stasis? In an ideal world we'd recount many inspiring stories of change, and develop a playbook for what we shall later dub metascience entrepreneurship74. Unfortunately, the inhibiting forces are strong, and many of the examples we looked at while researching the essay were depressing: "stories of sclerosis" is a better title for a horror movie than an essay about science. Still there are some partial successes, and in this section we discuss a major change we admire, both to see what we can learn, and to try to understand what remains to be done.

The example comes from the replication crisis in social psychology75, which we briefly mentioned earlier. The crisis is usually described as a negative event, but as we'll see it can be viewed as a positive prototype for changed social processes, while also illustrating the challenges of making such changes. It is perhaps most often associated to a remarkable 2015 paper76, which attempted to replicate the results of 100 experimental social psychology papers, all taken from leading psychology journals. This 2015 paper, put together by a collaboration of 270 authors calling itself the Open Science Collaboration, found that only 36 of the 100 replications reported statistically significant results; by contrast, 97 of the original studies reported statistically significant results. Furthermore, the mean effect sizes in the replications were roughly half of the original effect sizes.

This paper helped ignite a major controversy not just within social psychology, but broadly across science. Many people considered the drop from 97% to 36% a sign that something was badly amiss in social psychology. A 2015 New York Times article quoted Jelte Wicherts, an associate professor of scientific methodology and statistics at Tilburg University, as saying77:

I think we knew or suspected that the literature had problems, but to see it so clearly, on such a large scale – it's unprecedented.

In the seven years since publication, the Open Science Collaboration's 2015 paper has been cited more than 7,000 times. We've had individual scientists tell us things like: "I no longer trust many of my own papers before [some date, typically around 2014-2016]". It's not that they were dishonest in the earlier papers: rather, they were proceeding honestly and carefully as they understood things at that time, but they now understand that the methodological approaches they used in earlier work were unreliable.

Some well-known social psychologists were unhappy with the furor. In the same New York Times article, Norbert Schwarz, a professor of psychology at the University of Southern California, commented:

There's no doubt replication is important, but it's often just an attack, a vigilante exercise

In a highly critical response to the 2015 paper, also published in Science78, researchers at Harvard and the University of Virginia concluded:

We applaud efforts to improve psychological science, many of which have been careful, responsible, and effective, and we appreciate the effort that went into producing OSC [the Open Science Collaboration]. But metascience is not exempt from the rules of science. OSC used a benchmark that did not take into account the multiple sources of error in their data, used a relatively low-powered design that demonstrably underestimates the true rate of replication, and permitted considerable infidelities that almost certainly biased their replication studies toward failure. As a result, OSC seriously underestimated the reproducibility of psychological science.

As suggested by these quotes, the 2015 Open Science Collaboration paper triggered a lot of pushback and a robust and still ongoing conversation. What did the unsuccessful replications mean? Were there major problems in some techniques commonly used in social psychology? Were major reforms in the field needed? And what (if anything) did it mean for other fields of science? The paper had many thoughtful caveats, e.g.:

It is also too easy to conclude that a failure to replicate a result means that the original evidence was a false positive. Replications can fail if the replication methodology differs from the original in ways that interfere with observing the effect. We conducted replications designed to minimize a priori reasons to expect a different result by using original materials, engaging original authors for review of the designs, and conducting internal reviews. Nonetheless, unanticipated factors in the sample, setting, or procedure could still have altered the observed effect magnitudes… After this intensive effort to reproduce a sample of published psychological findings, how many of the effects have we established are true? Zero. And how many of the effects have we established are false? Zero. Is this a limitation of the project design? No. It is the reality of doing science, even if it is not appreciated in daily practice. Humans desire certainty, and science infrequently provides it… Scientific progress is a cumulative process of uncertainty reduction that can only succeed if science itself remains the greatest skeptic of its explanatory claims.

As earlier quotes suggest, there is disagreement about what's causing the replication crisis. However, many scientists suspect the cause is certain practices widely used in social psychology that may make it easy to publish incorrect results. For example: the practice of only publishing a study when it obtains a statistically significant finding. That sounds reasonable – after all, when things don't pan out, doesn't it make sense to move along quickly to your next project? It's practical also, since many scientific journals are extremely reluctant to publish null findings. But this practice has a dark side. If you do enough studies, pure chance means that occasionally you will obtain what looks like "evidence" for an effect, but was really a statistical fluke. Furthermore, if null results are seldom reported, that means the literature may fill up with papers providing "evidence" which is just such statistical flukes. What seems like a reasonable practice can create a consequential bias in scientific journals.

We've focused on the role of the 2015 Open Science Collaboration paper in triggering the replication crisis. But the replication crisis arose gradually. It's beyond our scope to do a full history, but useful to sketch some of the background. At first this brief history will seem disconnected to the arc of the essay: not a changed social process in sight! But we'll eventually see that these apparently disconnected historical facts reflect why it's so hard to change social processes, and what ultimately needed to happen. The 2015 paper was the culmination of a series of troubling events over 2011-2015. This series began in 2011 when a well-known social psychologist named Diederik Stapel was found to have committed fraud, fabricating data on a large scale. Since then, more than 50 of Stapel's papers have been retracted79. The year 2011 also saw the publication of a paper by the social psychologist Daryl Bem80, using the methods of social psychology to demonstrate evidence for precognition(!!) This wasn't a crank, it was a well-known and respected social psychologist publishing a peer reviewed paper in the high-profile Journal of Personality and Social Psychology. Furthermore, unlike in Stapel's case, there was no suspicion of fraud. Bem's work used standard practices in the field81. However, the extremely surprising nature of the results posed a sharp question about whether those standard practices might sometimes produce unreliable results.

In 2012, Stéphane Doyen, and colleagues82 published an attempt to replicate the famous "priming" effect in psychology – the result in "which participants unwittingly exposed to the stereotype of age walked slower when exiting the laboratory". That replication failed. It was an embarrassing failure for the field, in part because in 2011 Nobel Laureate Daniel Kahneman had published a widely-read popular book83 which described the priming studies extensively. In the book, Kahneman states about priming: "disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true." Yet the failed replications led Kahneman to lose much confidence in the field, and he wrote a widely distributed letter, noting that "questions have been raised about the robustness of priming results" and that "I see a train wreck looming"84. Kahneman's letter received a lot of public scrutiny.

With these events in the background, in 2012 it began to be common to use the phrase "replication crisis" to describe the state of social psychology85. Many social psychologists were concerned about the state of their field, and a steady drip of troubling results continued until the 2015 Open Science Collaboration paper demonstrated that the problems weren't just sporadic, but were plausibly affecting the entire field.

But while the replication crisis came to a head over 2011-2015, many of the problems had been known for decades. Writing in 1985, the well-known psychologist Paul Meehl86 identified 10 "obfuscating influences" in what he called "soft psychology". These were widely-used practices, including the practice described above, of only submitting for publication studies which obtain statistical significance. Meehl did not hold back:

The net epistemic effect of these ten obfuscating influences is that the usual research literature is well nigh uninterpretable…. If the reader is impelled to object at this point "Well, but for heaven's sake, you are practically saying that the whole tradition of testing substantive theories in soft psychology by null hypothesis refutation is a mistake, despite R. A. Fisher and Co. in agronomy," that complaint does not disturb me because that is exactly what I am arguing.

"Well nigh uninterpretable" is about as damning a phrase as we know to describe a large fraction of an entire field. Similar points, though less flamboyantly phrased, have often been made by other scientists. Writing in 1975, the well-known social psychologist Anthony Greenwald identified87 many of the same problems with significance testing. Among others, he identified the following behaviors as problematic:

submitting results for publication more often when the null hypothesis has been rejected than when it has not been rejected… continuing research on a problem when results have been close to rejection of the null hypothesis ("near significant"), while abandoning the problem if rejection of the null hypothesis is not close… failing to report initial data collections (renamed as "pilot data" or "false starts")… using stricter editorial standards for evaluating manuscripts that conclude in favor of, rather than against, the null hypothesis.

All these practices are understandable, very human behaviors. If your study "almost" shows statistical significance, it's tempting to run just a few more analyses in the hope of meeting the threshold for significance. That way you can publish it, rather than abandoning months of work as unpublishable! Unfortunately, such practices may also lead to a scientific literature full of erroneous conclusions. As mentioned in Part 1, in 2005 the meta-researcher John Ioannidis published the provocatively-titled "Why Most Published Research Findings Are False". The essence of Ioannidis's paper is that in many parts of science we lack good theories to tell us what are plausible hypotheses and what are not. Because of that, people are likely to test many false hypotheses: if they test far more false hypotheses than true (say, 20 false hypotheses for each true one), then false positives might outnumber true positives in the published literature! Indeed, for any given experimental data set it's often possible to run many different analyses, and in this way you're very likely to eventually find some apparently-interesting hypothesis that seems(!) supported by the data. The scientific literature would fill up with plausible results that were really just statistical flukes, not indicative of anything about nature.

Prior to the replication crisis coming to a head over 2011-2015, the papers by Meehl, Greenwald, and Ioannidis were highly-cited papers by well-known scientists. And there were other papers in a similar vein, correctly pointing out severe problems with standard practice in social psychology and other fields. Yet these well-founded criticisms caused few discernible changes in the practices actually carried out by scientists. Put starkly: the problems underlying the replication crisis had been widely known for decades before 2011-2015. And yet almost nothing had been done about it.

The great benefit of the crisis of 2011-2015 is that it was so acute that it helped instigate genuine methodological and social change. Many of these changes are being expressed in changed social processes. One is a gradual rise in scientists and journals using an approach to publishing papers called Registered Reports88. In the standard approach to publishing papers, scientists design and perform an experiment, analyze the results, and then submit a paper describing the results to a journal. The paper is peer reviewed and (if it passes peer review) is published. To pass peer review, a paper must be judged methodologically sound, and also "scientifically interesting". High-profile journals, in particular, set a high bar for what counts as "interesting". Unfortunately: "We looked for such-and-such an effect and found nothing" isn't usually considered "interesting". This creates a bad incentive: as a scientist, it's tempting to keep asking different questions of your data until you find a significant effect, one plausibly meeting the bar for "interesting", and so can publish. Or it's tempting to keep taking just a little more data until you achieve significance (and so can publish). And so on. Something which seems innocuous, even sensible – the desire of journals to publishing interesting positive results – has many bad consequences.

Registered Reports radically change this model. The idea is for scientists to design their study in advance: exactly what data is to be taken, exactly what analyses are to be run, what questions asked. That study design is then pre-registered publicly, and before data is taken the design is refereed at the journal. The referees can't know whether the results are "interesting", since no data has yet been taken. There are (as yet) no results! Rather, they're looking to see if the design is sound, and if the questions being asked are interesting – which is quite different to whether the answers are interesting! If the paper passes this round of peer review, only then are the experiments done, and the paper completed. A routine second round of refereeing is then done, to ensure methodological soundness, and the paper is published. As of this writing (2022), more than 300 journals are publishing Registered Reports.

There are encouraging signs that pre-registered study designs like this are helping address the methodological problems described above. Consider the following five graphs. The graphs show the results from five major studies89, each of which attempted to replicate many experiments from the social sciences literature. Filled in circles indicate the replication found a statistically significant result, in the same direction as the original study. Open circles indicate this criterion wasn't met. Circles above the line indicate the replication effect was larger than the original effect size, while circles below the line indicate the effect size was smaller. A high degree of replicability would mean many experiments with filled circles, clustered fairly close to the line. Here's what these five replication studies actually found:

As you can see, the first four replication studies show many replications with questionable results – large changes in effect size, or a failure to meet statistical significance. This suggests a need for further investigation, and possibly that the initial result was faulty. The fifth study is different, with statistical significance replicating in all cases, and much smaller changes in effect sizes. This is a 2020 study by John Protzko et al90 that aims to be a "best practices" study. By this, they mean the original studies were done using pre-registered study design, as well as: large samples, and open sharing of code, data and other methodological materials, making experiments and analysis easier to replicate. The labs that did the original studies each did their own self-confirmatory replication; they then asked three other labs to perform replications – the fifth graph above shows combined results from those independent replications. In short, the replications in the fifth graph are based on studies using much higher evidentiary standards than had previously been the norm in psychology. Of course, the results don't show that the effects are real. But they're extremely encouraging, and suggest the spread of ideas like Registered Reports contribute to substantial progress.

While the replication crisis is usually described negatively, it seems to us that in retrospect the 2010s will be seen as the beginning of a Renaissance in social psychology. 2011-2015 wasn't a negative time, but rather was the period when people began to shine a spotlight on problems that had existed for decades, and to take action to address those problems91. The eventual result will likely be a much-improved field. This is expressed through improved norms, improved tools, improved training, but all instantiated and funneled through new social processes, perhaps including most notably the widespread use of pre-registration, as well as ideas like open sharing of data and other materials.

That said, it's still a work in progress. Code and data sharing to help enable replication is increasing rapidly, but is far from universal. And it's not yet clear just how frequently used pre-registration should be. The earlier, non-preregistered style is useful for rapid exploratory work; eliminating that style entirely would set psychology back. It's important to be able to explore quickly and unrigorously! Such exploratory studies should continue to be done, but we need to be far more cautious in how we think about the results. Alongside those exploratory studies will be more rigorous studies, with much higher standards of evidence, using ideas like pre-registration. The end result may be two or more common styles of publication, with different epistemological statuses. Speculation aside, the underlying changes are still nascent: changes in people's values, their informal folk models of how to do science, and the way those are reified in social processes. But while nascent, progress is encouraging. It's an example where the social processes of science are changing in a significant way, broadly across a field. Social psychology is undergoing a major cultural change.

When we discuss the replication crisis with metascience colleagues we're occasionally told the work on replication is "good, but boring". It's not the incredible sunny optimism of flashy new research institutes, or fancy ideas for new funding approaches. No-one is giving triumphant interviews about how they're coming to save science. It's a set of simple but powerful ideas for social and methodological changes which will make social psychology more reliable. But while it isn't flashy, it is scaling out. Not just one or a few local changes, but significant and broad structural changes to the social processes by which psychology operates. And it did so after decades of inaction. It is a valuable example of how the social processes of science can change, from which much can be learned.

Learning from the Renaissance in social psychology

A crisis is a terrible thing to waste. – Paul Romer

What can we learn from the Renaissance in social psychology? For us, one important learning is that a deep crisis can help enable a broad change in the social processes of science. As we've seen, for decades well-known scientists had pointed out problems in the practices of social psychology, and suggested ideas for solution. Yet that understanding did not produce a major crisis, nor did it produce substantive change. Even minor crises – Bem's 2011 paper on precognition, or Doyen's 2012 failure to replicate the priming studies – were not enough. The crisis didn't become acute until the 2015 Open Science Collaboration. Only at that point did scientists and institutions start to become broadly willing to consider alternative ways of working92.

Of course, a crisis alone was also not enough. It required deep new ideas, such as Registered Reports93, as well as the toolbuilding and infrastructure necessary to make them work. It required partnerships with journals and other organizations, so that Registered Reports had a chance to be widely adopted. It required branding and marketing and narrative-crafting, to get widespread adoption and to begin a change in the internalized values of scientists. It required the building of yet more tools and infrastructure to store code and data and materials, to enable easier replication. And, as we'll see, it required institution-building. If those things hadn't happened on top of the acute crisis, the field would still be stuck. In this sense, the crisis instigated the conditions under which change was possible. But it required many other elements for change to actually take place.

Many of these required elements are outside the usual job description for scientists. Consider that Greenwald, Meehl, and Ioannidis were all doing what academic research scientists conventionally do: developing ideas and understanding, publishing papers. But while ideas and understanding are important, on their own they aren't enough to produce change. And scientists are usually "not supposed" to build tools and infrastructure, unless to take data immediately needed in their scientific work94, nor are they "supposed" to get involved in branding and marketing, activities that many scientists turn their noses up at. Yet those activities are essential for cultural change. Thus, the actions necessary to achieve change are not part of what is traditionally regarded as a scientist's job. This is a major inhibitor in achieving such change.

Many people have played important roles in instigating the replication crisis. But perhaps no single person has done more than Brian Nosek. Nosek is a social psychologist who until 2013 was a professor at the University of Virginia. In 2013, Nosek took leave from his tenured position to co-found the Center for Open Science (COS) as an independent not-for-profit (jointly with Jeff Spies, then a graduate student in his lab). Nosek and the COS were key co-organizers of the 2015 replication paper by the Open Science Collaboration. Nosek and the COS have also been (along with Daniël Lakens, Chris Chambers, and many others) central in developing Registered Reports. In particular, they founded and operate the OSF website, which is the key infrastructure supporting Registered Reports. That's not all OSF does, it's also a general platform for scientists to share papers, code, and data, and is designed to make it easy for other labs to replicate work. Finally, Nosek's been a frequent public advocate for replication, doing on-the-ground work to change how scientists think about the subject, which has required both strong scientific argument and also good marketing and brand-building. In short, Nosek and the COS are key figures in driving a massive, systemic change in social psychology. They're helping change the culture of science.

The origin story of the COS is interesting. In 2007 and 2008, Nosek submitted several grant proposals to the NSF and NIH, suggesting many of the ideas that would eventually mature into the COS95. All these proposals were turned down. Between 2008 and 2012 he gave up applying for grants for metascience. Instead, he mostly self-funded his lab, using speaker's fees from talks about his prior professional work. A graduate student of Nosek's named Jeff Spies did some preliminary work developing the site that would become OSF. In 2012 this got some media attention, and as a result was noticed by several private foundations, including a foundation begun by a billionaire hedge fund operator, John Arnold, and his wife, Laura Arnold. The Arnold Foundation reached out and rapidly agreed to provide some funding, ultimately in the form of a $5.25 million grant.

Buoyed by the funding, in 2013 Nosek left the University of Virginia to start the Center for Open Science. This may seem strange: why not keep it at the university? But as we've seen the work of the COS was not social psychology in the conventional sense. Rather, Nosek was something else, a metascience entrepreneur, working to achieve a scalable change in the social processes of science. Setting the COS up independently gave them freedom to operate in ways difficult in a conventional academic environment. For instance, in many universities it would be difficult and slow to hire the designers and programmers needed to develop infrastructure such as OSF and Registered Reports. Nosek estimated to us that roughly 1-in-5 of the COS staff could be considered researchers in anything like the conventional sense. The repeated objection when attempting to make such hires in an academic environment is "that's-not-really-science". It's ironic, in retrospect: Nosek and the COS are having tremendous impact on psychology, as a consequence of placing metascience at the core of their practice. It's a more expansive view of what a scientist can be.

This story concretely reflects many of the inhibiting factors we discussed earlier. Consider the "that's-not-really-science" problem: change in the social processes of science is no-one's job, certainly not the job of working scientists. Or the challenge of raising funds via conventional channels: it seems to us not an accident that the COS ultimately raised money from an unconventional source. And then the structural obstacle of hostility from important peers. Nosek reports a journalist telling him that a "big shot" peer had said "Nosek is John Arnold's willing idiot". Tage Rai, who became an editor at Science magazine after the 2015 article, repeatedly attacked work on replication, claiming for instance that there "are powerful private and gov't interests hoping to co-opt the replication crisis to gain leverage over deciding what kind of research you can produce", and directly attacking Nosek. So the replication crisis is a story of perseverance, not just by Nosek, but by everyone involved. We've argued here that cultural change is often enabled by sustained critique of powerful institutions in science. This requires bravery, and may have substantial professional consequences, under science's self-governing model, whereby an individual's future is determined by the judgments of their colleagues. Creating a crisis for scientific institutions is a deeply unpopular career move.

Nosek and the Center for Open Science have played a key role in the Renaissance in social psychology. However, it's unfair and historically incomplete to single out an individual as much as we have: the work we've described is part of a movement involving hundreds or thousands of other people. Unfortunately, a detailed history of the replication crisis is beyond our scope. To those many people whose work we have unfairly omitted, our apologies. The reason we've given this incomplete history is because it highlights an important pattern for change in the social processes of science: that of the metascience entrepreneur.

Metascience entrepreneurship

You never change things by fighting the existing reality. To change something, build a new model that makes the existing model obsolete. – Buckminster Fuller

A metascience entrepreneur is a person, especially an outsider, who aims to make a scalable improvement in the social processes of science. They take responsibility for making change happen. To list a few metascience entrepreneurs now or recently active: there's Paul Ginsparg, who founded the arXiv preprint server. There's Heather Joseph and Peter Suber, each instrumental in different ways in helping increase open access to scientific papers. There's Adam Marblestone, Anastasia Gamick, Sam Rodriques, and Tom Kalil, who are developing the Focused Research Organizations we mentioned earlier. There's Don Braben, who started the Venture Research Unit at BP and later University College London, making the case for giving scientists far more freedom to pursue ambitious lines of research. There's Brian Nosek and many others at work on the Renaissance in social psychology. And, of course, there are many other metascience entrepreneurs96. Not all changes in the social processes of science come about through metascience entrepreneurship, but it is an extremely important pattern97.

The term metascience entrepreneur is new, and we considered many terms before deciding to use this one. Perhaps the best alternative we considered is "applied metascientist": it's shorter, rolls off the tongue more easily, and some scientists are suspicious of the commercial overtones of "entrepreneur"98. However, a problem with "applied metascientist" is that it implicitly suggests a body of theoretical metascience, which is then used by the applied metascientist. In practice, we expect the connection to be strongly bidirectional, and for this reason have preferred "metascience entrepreneur". We are concerned that the term may mislead people into thinking it's about making a profit; rather, it's about a more expansive notion of entrepreneurship, as a person working to build the future. In any case, a good term for this concept is certainly needed, and we're open to better language.

We've argued that creating a crisis has sometimes been crucial in instigating widespread changes in social processes. Is it possible to avoid creating a crisis, or is it necessary? We believe it is possible to avoid this when the social process is in whitespace, but it's much harder when there's an incumbent system and network effects inhibiting change. For instance, prior to the replication crisis it was extremely difficult for social psychologists to unilaterally adopt better practices. This was in part for practical reasons (the tools and infrastructure were lacking), and in part for social reasons (doing so would place them and their collaborators at a short-term competitive disadvantage over their peers). Similarly, while (say) someone like Don Braben can run a trial program for his approach to scientific freedom, that doesn't mean his approach will gain traction in displacing more traditional approaches. That would likely require a crisis of confidence in those traditional approaches, to help instigate a change.

In our explanation of metascience entrepreneurship, we've emphasized the role of outsiders. Of course, people like Nosek, Ginsparg, et al aren't outsiders in certain senses: after all, they've spent their careers as working scientists. But we mean outsiders in the sense that they weren't decision makers at powerful institutions, able to move vast sums of capital, or change policy. Nor were they enormously wealthy, and able to simply build what they desired. The reason it's important to have outsiders as metascience entrepreneurs is that it will decentralize change in the social processes of science. That will expand the range of people who can cause change; and increase the range of ideas that are tried. Indeed, it may be essential for outsiders to participate, if severe and sustained critique of existing institutions is required. A crisis can be extremely difficult to instigate from inside a system. Nosek, recall, was unable to obtain support from conventional funders; when he did obtain funding, he took leave from his academic job to set up the Center for Open Science. Of course, over time he's partnered with existing institutions to cause change. But the change was instigated from outside existing power centers. That's the pattern of a successful revolution.

For these reasons, we don't expect all or even most of the best ideas for change to come from the directors of a few major grant agencies; nor from wealthy individuals hoping to have an impact. We expect the best ideas will come from unexpected people at the edge. As we noted earlier, it was a huge achievement of the 17th century scientific revolution to decentralize change in ideas: Bacon helped break the monopoly of the Church and Aristotle on authority; the Royal Society baked decentralization into their motto, nullius in verba, take no-one's word for it. We need the same kind of decentralized ability to change social processes in science. And there is currently no institutional means to make such changes.

This outsider-oriented approach may be contrasted with an insider-oriented approach. Certainly, many people want to help existing funders and research organizations trial new ideas, and to scale out the best, perhaps doing something like J-PAL-for-science. The people leading and advising such organizations can do much good, but these are likely to be insider roles, incremental and evolutionary. Many people think that if they just go work at their favorite funder for a few years, they can make the big changes necessary. But they get bogged down in the inertia of existing systems. Most organizations are not designed to rapidly change, and they're certainly not designed to admit error in their past ways of doing things. We've had conversations where people at the top of some of the world's largest funding agencies express the belief that their hands are tied: the organization requires a much larger change than it is capable of making. It's very different to the outside view of the metascience entrepreneur, who may be working in social process whitespace, or with truly disruptive criticism of existing institutions. Similarly, some of those people at the funders have expressed to us that they are doing metascience entrepreneurship. And yet they're not: their obstacles are quite different, and require separate analysis.

This is why we've focused on metascience entrepreneurship, and (in what follows) on structural changes intended to help enable outsiders to win. Outsiders may well start with far fewer resources than incumbents; but if their ideas and execution are better, in an ideal system they would grow to outcompete those incumbents.

The work of such metascience entrepreneurs potentially looks quite different from what is usually evaluated as promising. Surprisingly many proposals come from true outsiders (sometimes denizens of Silicon Valley), including sometimes people who have had no more than modest success in a scientific career. This used to annoy one of us (MN), when he saw overconfident pronouncements about "what's wrong with science" from such outsiders99. The retort which came to mind: "If you know so much about science, why haven't you made any important discoveries?" But he's (somewhat!) changed his mind. Overconfidence in determined action is a feature of Silicon Valley that deserves to be emulated elsewhere, enabling ideas to be explored that otherwise would not be100. Of course, it usually fails. But genuine scientific successes may pay for all the rest. We would not have invested in DeepMind, a company started by green graduate students with venture capital backing. And we would have been wrong, not just financially, but more importantly scientifically. Sometimes outsiders find better ways of doing things.

Let's return to the metascience learning loop:

There are many things existing organizations can do to enable this process. All of the metascience entrepreneurs discussed above have already had (or likely will have) world-changing impact, far beyond a typical scientific grant; despite this, all have had unusual amounts of trouble raising funding. Funders aren't usually looking to fund this activity, often regarding it as "not really science". Cultural change isn't currently anybody's job. As a funder, two simple enabling activities would be:

As we've noted, metascience entrepreneurship often requires a crisis for change to be possible. However, crises also make incumbents unhappy. In many instances, if people aren't shouting at a metascience entrepreneur, it means they're not having much impact. Brian Nosek being called "John Arnold's willing idiot" is in some sense a sign of success. Or, more subtly, dismissing the people as "not real scientists", but rather merely cleaning up the work of their betters. Put another way: community anger or dismissiveness may in fact be a useful proxy for success. For these reasons, programs like those above may be a good target for outsiders, funders with strong stomachs and a contrarian desire to leapfrog the Scientific Establishment. Over the long run, a good goal would be to change the conception of a scientist's role so it includes metascience entrepreneurship.

In a more conventional mode, it should be possible for all funders (and, more broadly, scientific institutions) to set up an:

Ten percent is a lot by current standards. We think of it as the funder's self-improvement fund. In the early 20th century, top-level athletes often "trained" mainly by playing their sport. They didn't really understand how to train. But training is where you try new things, figure out how to make them work, and change your game, perhaps radically. Putting in ten percent seems like a lower bound on what it would mean to take training seriously.

What we'd like to see, ultimately, is the emergence of a thriving community of metascientists101. This community would blend both theoretical metascience, studies of what works well, as well as metascience entrepreneurship, attempting to imagine and build the social processes of the future. In this future, metascientist would become a shared aspirational identity, much as "scientist" or "entrepreneur" is now. A community of practice could develop, sharing ideas, mentorship, resources, and so on. It could identify patterns to improve science, and patterns which result in failure102. The lack of awareness of these patterns means that today most attempts at metascience entrepreneurship simply fail or never get off the ground. Over the remainder of the essay we'll make the case that if this community thrived, it could place metascience at the core of science, a kind of engine or dynamo driving decentralized improvement in the social processes of science.

Patterns of metascience entrepreneurship

We've discussed the replication crisis in detail because it suffers extreme versions of many of the inhibiting effects identified earlier. We don't have space here to discuss many other examples in similar detail. But it's helpful to identify some of the different patterns that metascience entrepreneurs have used.

The preprint arXiv, for example, relies for its success primarily on building a great product that is attractive to scientists. One major inhibiting factor was that "it just wasn't science". The founder of the arXiv, Paul Ginsparg, was a well-known physicist working in string theory at Los Alamos National Laboratory (LANL). But as he developed the arXiv it consumed more and more of his time, and he gradually became what we would call a full-time metascience entrepreneur. He ultimately resigned his position at the lab after receiving an unfavorable performance evaluation, describing him as having "no particular computer skills contributing to lab programs; easily replaced, and moreover overpaid, according to an external market survey". He went to Cornell University, and more and more of his professional identity became invested in the arXiv. One of his new colleagues at Cornell said of his Los Alamos performance evaluation, "Evidently their form didn't have a box for: ‘completely transformed the nature and reach of scientific information in physics and other fields’"103.

The arXiv didn't suffer as strongly from inhibiting network effects as did the changes we've discussed in social psychology. For instance, for the most part a scientist submitting to the arXiv doesn't suffer any ill effects. When they first use the arXiv they may fear getting scooped or violating journal publication policies, but these relatively minor fears are usually quickly overcome104. The arXiv adopted a clever growth strategy, starting in a relatively narrow area – high-energy theoretical physics – where they could bootstrap off Ginsparg's professional connections. Scientists in closely adjacent areas would then start checking in, making those adjacent areas good avenues for addition to the arXiv105. It remains unfortunate that the arXiv's revenues are far lower than The Physical Review, perhaps the premier publisher in physics; the arXiv has become far more important for the progress of physics (never mind other sciences) than The Physical Review.

A very different approach to metascience entrepreneurship has been taken by open access advocates such as Heather Joseph and Peter Suber, who have helped intellectually architect and lobby for open access policies. We said earlier that it's rarely practical to pursue change by working through powerful grant agencies. But that's the strategy that's been (successfully!) pursued by open access lobbyists. The essence is to convince powerful granting bodies and their political masters to adopt open access mandates. For example, the US Congress made it so that from 2008 on NIH-funded research had to be made open access within 12 months of publication. This policy has since been strengthened, including most recently in 2022, when the White House directed all US government agencies funding research to develop policies which would make that research immediately openly accessible. This progress has been achieved through extensive lobbying over the past 20 years. In this sense it's quite different from arXiv: it's not a product used directly by scientists, but rather a policy change aimed at achieving collective action. To get this action taken, it's been necessary to enlist help from leading scientists to influence powerful entities, but not to convince every individual in a scientific community. We note, however, that some of the same inhibiting patterns are still in play: echoing earlier examples, Suber resigned his position as a tenured professor, and was funded by a series of temporary grants.

Another metascience entrepreneur, Don Braben, started the Venture Research Unit with funding from BP in the 1980s. This was an attempt to identify and fund unusually adventurous scientific work. Braben was interested in funding scientists who were potential members of what he called "the Planck Club", people doing work of a caliber comparable to Max Planck, the originator of quantum mechanics. Braben convinced BP to spend upward of 20 million GBP on the Venture Research Unit, funding roughly 30 proposals from scientists around the world. To convey the flavor of what he was looking for, Braben has proposed the BRAVERI score (Braben Venture Research Index), which gives high scores to proposals for research which: is difficult to define; addresses no extant peer group; would have trouble being published in a mainstream journal; has little or no competition; and has no clear definition of success at the outset. To give more concrete flavor, for the timescale component of the score, Braben proposes rewarding projects where the timescale is "indeterminate" and "successes, whatever they are could be achieved at any time, or never". By contrast, a project would be penalized if it was "expected to meet its targets in the time allowed"106. The Venture Research Unit was a remarkable trial, and he's written several striking books about this and related topics107. It's influenced other programs, but it's not an example of decisive scaling of change in the culture of science. For instance: no large-scale funding program that we know of is using criteria clearly descended from BRAVERI. It's an example of the difficulty in scaling. We expect most metascience entrepreneurs today will meet the same fate, regardless of whether their idea is a decisive improvement over the status quo or not.

There are many striking contrasts between these instances of metascience entrepreneurship. As we've seen, Braben's approach has not yet scaled out. While he provides a nice list of projects funded, it's not decisive evidence that his approach is better than the conventional approach. Someone initially hostile to Braben would be unlikely to be won over by the evidence he gives. By contrast, the Center for Open Science is succeeding because they have obtained decisive evidence that: (a) things weren't working in social psychology; and (b) they have developed a far better set of social and methodological processes. The evidence is so strong it has won over even people initially hostile to the approach108. The arXiv and open access mandates are different yet again. The open access mandates were achieved by persuading powerful people and institutions. The arXiv achieved change through individual scientists making the decision to routinely use the arXiv. We are personally in favor of both, and believe the core arguments made for them are correct – indeed, one of us (MN) worked for some years as an open science advocate. But despite that, someone hostile to those arguments could in good faith remain so109: we don't have overwhelming proof that either the arXiv or the open access mandates are decisively better for science. With that said: we're sympathetic to the decentralized product adoption pattern used by the arXiv. Absent strong evidence to the contrary, we trust individual scientists to choose what's best for them. We're more cautious about the centralized change pattern exemplified by the open access mandates. A danger with any centralized pattern is that persuasion and politics may result in damaging processes being adopted, unless a high evidentiary bar is applied at the point of control. Note that we're not talking here about the specific open access policies, which we strongly believe are helpful, but rather making an abstract point about the nature of the change mechanism. The key overall point is that we want patterns of successful change to result in improvements to science.

We've identified several distinct patterns of metascience entrepreneurship. Through the remainder of the essay we'll focus on one in particular, used in the replication crisis. That pattern is perhaps the most challenging, requiring both decisive results in theoretical metascience and effective metascience entrepreneurship to drive change. But first let's make some summary remarks about the other patterns. In general, the decentralized product adoption pattern has limited range of applicability, since many social processes aren't expressed through products. But when it does apply it can be very powerful, and not strongly subject to the earlier inhibiting effects. For that reason we won't discuss it much through the remainder of the essay. The centralized change pattern can work, but has the problem that arbitrary changes can be made on the basis of good optics, political palatability, and so on, without any guarantee of improvement. Of course, no-one will think that's their program! The best way of making this a good pattern is to improve evidentiary standards from theoretical metascience. Our subsequent discussion of theoretical metascience thus applies to this pattern. As we noted earlier social processes that are local and not collectively held can be changed relatively easily, and for that reason we don't discuss them further. Through the remainder of the essay we focus on the theoretical metascience pattern. It applies to many social processes, and it's able to overcome all the inhibiting factors mentioned earlier, including entrenched norms, the shadow of the future, network effects, and so on.

Incorporating the earlier distinction between changes in local versus collectively held social processes, we can sum up the situation visually as follows:

There's also a sliding scale in difficulty between doing work on social processes where there's no existing established process or few inhibitors, and situations where there are major inhibitors and established processes:

Can we use decisive metascientific results to drive improvement in the social processes of science?

Let's return again to the metascience learning loop,

The replication crisis in social psychology is an example where scaling has been achieved, with metascientific results driving major changes to an entire field. Is it possible to develop a broader array of metascientific results strong enough to drive similar improvements in other social processes? Variations on this vision have been proposed by many people. For example, in 2012 Pierre Azoulay110 proposed that:

It is time to turn the scientific method on ourselves. In our attempts to reform the institutions of science, we should adhere to the same empirical standards that we insist on when evaluating research results. We already know how: by subjecting proposed reforms to a prospective, randomized controlled experiment. Retrospective analyses using selected samples are often little more than veiled attempts to justify past choices.

Many initiatives in this vein are underway. They include the Research on Research Institute111, led by James Wilsdon; the work by Paul Niehaus and Heidi Williams to develop J-PAL-for-science112; and the proposal, led by Brian Nosek, for an NSF Science and Technology Center for Improving Science113. There are also related ideas, like the proposal from James Evans and Jacob Foster to use research on metaknowledge to "reshape science"114.

This vision arises out of a much older tradition of measuring science. One early example comes from James Cattell, who in 1910 wrote an early book about metascience115, posing the challenge: "It is surely time for scientific men [sic] to apply scientific method to determine the circumstances that promote or hinder the advancement of science." Today, merely naming the disciplines that have contributed to this tradition takes some time – economics, scientometrics, science policy, and many more. Let us look briefly at a few representative results from several of these literatures. We do so with no pretense of completeness. Rather, our intent is to provide some context to better understand the opportunity and challenge for metascience.

One rich vein of work is the idea of comparing different countries against one another. This has been done, for example, in papers by Robert May and David King116, who use measures such as citations, publications, and prizes to do country-level comparisons. It's also been pursued, in very different ways, by economists such as Robert Solow, Zvi Griliches, Paul Romer, and Robert Gordon, who have tried to develop an understanding of the relationship between scientific research and concepts from economics such as GDP growth. Both these veins of work can be viewed as a kind of macroscience, attempting to develop aggregate measures to understand (and in many cases manage) the progress of science at a high level. There's also a more intermediate-level mesoscience, things like the Shanghai rankings of research universities, or the UK Research Excellence Framework (REF), which ranks individual university departments. Both these veins of work evaluate science at high levels of abstraction – high enough that they are not intrinsically connected to social processes in a straightforward way, nor to systematically improving our understanding of what works, when, and why. With that said, we have heard many stories of the way evaluations like the Shanghai rankings and the REF affect behavior and processes. It will be interesting to monitor the long-run effect of such evaluations on the rate at which the discovery ecosystem learns.

Another rich vein of work is focused less on aggregates, and more on understanding the details of individual scientists, discoveries, and programs. Again, we'll merely mention a few representative lines of work, out of thousands that could be given117. There is work, for example, studying what team structures make for important work and why, like the paper "Large teams develop and small teams disrupt science and technology", by Lingfei Wu, Dashun Wang, and James Evans118. There is work studying the increasing age at which scientists make major discoveries, and why that increase may be occurring, like that of Benjamin Jones119. There is work doing retrospective analyses of funding programs, like Caroline Wagner and Jeffrey Alexander's analysis120 of the NSF's Small Grants for Exploratory Research program.

This vein of work gives rise to many striking descriptive facts about science. The work is often suggestive of changes. For instance, the Wu et al paper just mentioned suggests that: "both small and large teams are essential to a flourishing ecology of science and technology, and […] to achieve this, science policies should aim to support a diversity of team sizes". As another example, a tremendous amount of work has been done on the relationship between aging and breakthrough discoveries, which has led to much concern at some funders about the aging of their grantees, and attempts to change the situation.

It is interesting to contrast these styles of work with the replication crisis in social psychology. There, the metascientific results were so strong they forced change on a resistant field. Recall the early inability to raise funding from conventional funders; and attacks from central figures of the field, characterizing Brian Nosek as "John Arnold's willing idiot". And yet despite that resistance, they appear likely to revolutionize the methodology and social processes of an entire field. That has required both entrepreneurial building, as discussed in the last two sections, and also extremely strong theoretical results. Recall the theoretical takeaways from our earlier discussion: (1) there is a significant problem where conventional (pre-2015) practices give rise to many apparently non-replicable "discoveries"; but (2) improved practices do replicate. These conclusions arise from a large body of work – we mentioned a few key papers, but really it's the body of work that matters. Taken together, they are decisive metascientific results, strong enough to drive real change. By contrast, most results – like those mentioned in the past few paragraphs – are more descriptive results, often documenting interesting facts, but rarely dispositive of change, especially when it would upset an existing order.

The vision we are interested in, then, is of a metascience which routinely obtain results strong enough to drive real change, even when the evidence comes from outsiders. It is useful to compare the situation to fields where it is routine to obtain such decisive results. Think of the theory of relativity, which overturns some of humanity's most dearly-held ideas about space and time, energy and mass. Many students of physics are initially hostile to relativity, disliking the violence it does to their pre-conceived notions. But the evidence for relatively is so overwhelming that they are forced to accept a new order. That new order accepted, it radically changes the way they interact with the world. In a similar way, the replication crisis has provided a set of theoretical results so strong that it caused a crisis of confidence in old processes, and caused a transition to new processes. But there are as yet very few other results as strong in metascience. Most results are merely suggestive, not dispositive of change. Certainly, they are not strong enough to overcome opposition from powerful entrenched interests. Is it possible to routinely develop metascientific results strong enough to drive real, counterintuitive change in our social processes, even when those results disrupt the existing order?

To express these ideas from now we'll use the term decisive result to mean a result sufficiently strong that it would routinely convince someone who was initially hostile or had a vested interest in a different conclusion. This is not a rigorous or precise definition, but it does encode a useful criterion, capturing the notion of a result strong enough to force change, driving the scaling step in the metascience learning loop. To repeat the above in this new language: the key results of the replication crisis were, altogether, a decisive result in this way.

It's notable that the kind of work required to obtain decisive results isn't just business-as-usual. We'll have more to say about this in the next section, but: consider the extraordinary scale of the Open Science Collaboration: 270 people working together over several years to replicate 100 experiments. Furthermore, this was merely a small part of a much larger effort. There have been many large-scale replication studies; studies of individual questionable research practices; and studies of proposed solutions, like pre-registration. It's a major co-ordinated body of work that is only decisive when considered together as a body. The Open Science Collaboration was far beyond the scale of just another nice paper; but it was also only a small part of what was required. Such a large and nuanced body of work seems required when pursuing a compelling understanding of complex social systems. And, of course, such decisive results still require entrepreneurial activity to result in change.

The situation in metascience today is similar to one that arises often in many parts of the social sciences. One dream of many social sciences is that they will help guide human behavior, and help improve the design of human institutions. Perhaps, for example, psychology can help people lead happier lives, be better parents, and so on? Perhaps economics can help us make better choices about the way central banks set interest rates, about housing policy, immigration policy, the minimum wage, and so on?

This is a marvelous dream which has led to significant successes. But there are also challenges. One is the intrinsic difficulty of solving the problems: a question like "what should the minimum wage be?" is immensely complex, involving a tremendous number of different variables. Second, it's difficult to balance the interests, values, and power of different groups. Third, even if these difficulties can be overcome, and you develop an extraordinarily compelling answer, there is still the problem of whether there is institutional capacity to make the changes. Using social science to motivate change – especially when it runs against some people's perceived self-interest – is extremely difficult121.

Another reflection of these issues has been discussed by the development economist Lant Pritchett, who described program evaluations in development economics as "weapons against the weak"122. That is, politically weak or unpopular programs tend to be much more stringently evaluated. In this view, those evaluations are often ways of justifying decisions that powerful people want made. When our theoretical tools are weak, politics will dominate over evidence. Policy entrepreneurs, certain of the righteousness of their cause, will wield papers as weapons to serve their ends, even when they are wrong about how the world works. It will be extremely difficult for change to come from outsiders, even when they have important and correct new ideas about the world.

Theoretical metascience beyond RCTs

A fundamental question of metascience then is: can we develop results that can be used as weapons against the strong, enabling decentralized improvement in the social processes of science123? As our discussion of the social sciences suggests, this is a considerable challenge! In the passage quoted above, Pierre Azoulay makes a general, broad methodological suggestion for how to do it: "by subjecting proposed reforms to a prospective, randomized controlled experiment." In this section we'll consider and critique this idea, and make some suggestions for building upon it.

A striking prototype in this vein is the 2011 paper "Incentives and Creativity: Evidence from the Academic Life Sciences", by Azoulay, Graff Zivin, and Manso (AGZM for short). Informally, AGZM is often described as testing "people-based" versus "project-based" funding. In particular, it compares the outcomes of the HHMI Investigator program, which gives scientists freedom to work on what they want, with NIH funding, where scientists must seek approval for each new project. AGZM does an unusually careful comparison of the two approaches. It doesn't just look at outputs from the two programs. The trouble is that the people entering the two programs might have different backgrounds, interests, and capabilities. Instead, AGZM attempt to make a fair like-to-like comparison, so the HHMI program can plausibly be considered an intervention, and NIH a control. It's not a randomized controlled trial (RCT) – there's (currently) no way to randomize scientists into HHMI or NIH. But it's plausibly RCT-like. AGZM obtain many striking results, including finding a 39% increase in publication output for HHMI-funded scientists; furthermore, that increase becomes 96% when focused on papers in the top 1% of the citation distribution. These all seem suggestive that "people, not projects" is correct.

AGZM has much to recommend it, notably the idea of doing an intervention and making a fair comparison. It's also got many limitations, as we shall see below. This means that it is not a decisive result in the sense described above. It's true that if you want to believe in "people, not projects", then AGZM provides a fig leaf. But if you don't, then we doubt it would convince anyone otherwise. And that's what decisive evidence can do: convince you of things you regard as a priori unlikely or distasteful. We ourselves personally prefer "people, not projects", and for that reason are sympathetic to AGZM's conclusions. But that's a good reason to be distrustful: if the same type of evidence had shown "projects, not people", we would find reasons to distrust124. That is the litmus test for a decisive result in metascience: is the evidence strong enough to drive real change to displace an existing social process, despite the opposition of people who favor that process?

In what follows we briefly critique AGZM and similar RCT-like proposals, and make a few observations about how to obtain decisive metascientific results. We won't solve the general problem of how to obtain such results: that will take thousands of people many decades. One small indicator of the difficulty of the problem is that the decisive results of the replication crisis weren't obtain by an RCT; rather, they were obtained in a bespoke fashion, in response to a particular problem. Still, despite the challenges, we are optimistic that further work will develop approaches strong enough to routinely provide decisive metascientific results, completing the metascience learning loop.

Isn't discovery intrinsically extremely slow to evaluate? A common retort to any effort to improve the social processes of science is: "It's impossible to do this well because it takes so long to understand the importance of scientific discoveries. Doesn't that make it impossible to set up learning feedback loops like the ones you're talking about?" There's some truth to this: it often does take a very long time to understand the importance of a discovery. But it does not therefore follow that we can't make progress. That would be like saying "the human eye can only resolve down to tens of microns, therefore it's impossible to see E. coli bacteria". Much of the challenge in science – and we expect in metascience – is to develop better theories and better instruments to amplify signal amidst the noise. It's a challenge to develop better metascientific instruments, perhaps a metascience microscope or chronoscope125 to help us better and more rapidly understand the importance of scientific work, amplifying currently illegible signals into something meaningful. Good scientists do this kind of evaluation intuitively: the challenge is to figure out how to do it as well or even better systematically. With that said, this isn't a concrete proposal, we're merely pointing out that the "reasoning" many people use here is a non sequitur: "it's not so obvious how to do this, therefore it's impossible". And that's just wrong: the history of science is in considerable part about learning to see better.

Seriously study the outliers: It's not controversial to say it's a good idea to support potentially Nobel Prizewinning research! But when people study research outcomes they often focus on the bulk of the curve, i.e., more or less typical outcomes, not extreme outliers. This was true in many of the examples we mentioned earlier, looking at statistics of team size, age distributions, citation aggregates, and so on. It's a kind of data-based approach, aimed at forming an overall picture of what's going on. But there's an implicit tension lurking: there is no a priori reason that what is good for the bulk of the curve is also good for outlier outcomes. Indeed, improving typical outcomes by better controlling the process may suppress exactly the kind of wild variation underlying outliers. We've discussed this point in detail in a prior essay126, which goes more deeply into some associated problems127 than we shall here. A nice shorthand for thinking about it is the following diagram:

As a specific example, we noted above that AGZM used as a metric the increase in total publication output, as well as the increase in top-1% cited papers. While top 1% is certainly good, most such papers are still in the bulk of the curve, they're not actual outliers. Implicit in such an evaluation is the assumption that you want to improve the bulk of the curve. But if outliers actually dominate discovery, then they should be the core of what we try to evaluate and support. Improving the bulk may ignore what is most important and lead to erroneous or even counterproductive conclusions.

Now, with that being said we also believe it is valuable to consider such bulk measures. But the picture formed will be deeply incomplete unless the outliers are also identified and studied. So any serious evaluation program must also systematically identify the most important outliers and compare those. In other words, serious comparisons of different approaches to discovery should combine both the bulk of the curve and also identify and compare outlier case studies. To do only one and not the other is, in our opinion, a potentially misleading mistake. This is in contrast to a common point of view which is to treat large data-based studies as somehow intrinsically "more serious" than a few case studies about outliers. But this is a case where anecdotes may trump data. At the very least, such case studies must be considered as first-class participants in the evaluation process, alongside more data-based approaches. It is only by taking both seriously that it is possible to see whether and when there are tensions between the bulk and outliers128.

One challenge in comparing outliers from different programs is that the size of the program matters a great deal. A program with a $10 billion budget should be expected to have many more extreme outlier outcomes than a program with a $10 million budget. This makes comparing outliers tricky. For instance, it's popular to laud Bell Labs for their many Nobel-worthy discoveries. It's true, they were exceptional. But they also had a research budget on the order of many hundreds of millions of dollars a year, for decades (2022 dollars)129. They should have done very well at that scale! One plausible way of making a fair comparison to a much smaller research laboratory is to compare randomly chosen but equally sized subsets of the research outputs. (It would be a good research problem to make that comparison fairly!) An illustrative example close to our hearts is the Dynamicland laboratory in the San Francisco Bay Area. This is an exceptional and very unusual independent laboratory. But in part because it is so unusual it has sometimes struggled for funding. We've often extolled its virtues to potential funders. Some of them – usually wealthy people in tech – have made the observation that Dynamicland's output doesn't yet appear as impressive as the very best outcomes from Bell Labs – things like information theory and the transistor. This is true. However, Bell Labs had two to three orders of magnitude more funding in any given year, and existed over the better part of a century, not less than a decade. It would be shocking if Dynamicland could compare on outliers. But we expect that if you compared Dynamicland to a randomly chosen Dynamicland-sized slice of Bell Labs the comparison would be extremely favorable to Dynamicland. This point is illustrated in more quantitative detail in an Appendix to the essay130.

One-and-a-half cheers for citations: It's a cliche in metascience that "citations-are-limited-or-bad-and-we-have-qualms-about-using-them… but we're going to use them because ¯\(ツ)_/¯". A venerable example comes from Eugene Garfield, the creator of the Science Citation Index, who enumerated many thoughtful criticisms of citation analysis and commented that131: "none of the criticisms are unfounded. Most of them are based on facets of citation analysis that pose either theoretical or real problems in using the technique to evaluate people. Those using citation data to evaluate research performance at any level, but particularly at the level of individuals, must understand both its subtleties and its limitations."

But despite such hand-wringing about the limitations of citation analysis, it is extremely widely used. This is illustrated by our prototypical example, AGZM, which relies on ad hoc citation-based measures to quantify scientific merit. It's not the only thing they use, but it is their primary metric.

It's easy to understand why citation analysis is so popular. Citation data is relatively easily available. Citation analysis is easy to apply across disciplines, producing clear "results" with a pleasing sheen of quantitative rigor and respectability. You can scale the analysis up, all while not needing to understand anything much about the underlying ideas in the scientific literature itself. And you can write lots of papers with relatively simple scripts.

All told, this kind of citation analysis seems to have many of the same benefits as choosing a basketball team's players by simply trying to maximize the total height of the players. It's not entirely wrong. But it's also not taking the problem seriously, and will certainly mislead. Citations have no intrinsic connection to scientific progress at all; they are closer to exhaust from progress, not progress itself. Proposals to understand the evolution of ideas in detail are often dismissed as "not scalable". But it seems strange to us to prioritize scalability as the binding constraint over understanding what's actually going on. If you're going to learn to do something well at scale, it makes sense to first learn to do it well locally, even if that approach is not obviously scalable132. Fortunately, we believe a complementary approach is possible:

Two-and-a-half cheers for the history of science: By contrast to citation analysis, there is already a field that studies the evolution, influence, and importance of ideas in science: the history of science. A good illustrative example, one of many, is the work of Abraham Pais. An outstanding theoretical physicist as a young man, in his 50s he switched to become a full-time historian, writing excellent biographies of Albert Einstein and Niels Bohr, as well as a history of particle physics. As just one example, consider Pais's beautiful biography of Einstein, "Subtle is the Lord"133. What you learn from Pais is how to think about Einstein's work. Why did it matter? Where did the ideas come from? How did they change later thinking? How did they change the course of science, and of human civilization? He traces the lineage of thought, through letters, papers, books, and other people's accounts. In this way, we come to appreciate both Einstein's changing internal understanding, and also how that fits into the collective mind and intellectual landscape of the time. By contrast, citation analysis seems almost like (a caricature of) behaviorist psychology, studying the external forms of science, but not in any depth the evolution of the intrinsic underlying ideas. It is only by doing the latter that we can properly grasp the changed understanding. That is the remit of history of science, and also (to some extent) of the sociology and philosophy of science.

Now, of course, the history of science is a very active field. But, so far as we know, its approach is not usually used as the basis for program comparisons of the type we are discussing. And it's easy to see why: it's not scalable and superficially "objective" in the way citation analysis appears to be. But we believe it is far more reliable, getting at the actual importance of ideas. It's especially preferable for understanding the importance of individual outlier discoveries. And, as we said above: if you're going to learn to do something well at scale, it makes sense to first learn to do it well locally, even if that approach is not obviously scalable. And so we believe that the techniques of the history of science should take on a central role in making program comparisons, especially in understanding the outliers. It can and should be placed at the foundation of scalable analyses. Citation analysis is also valuable, but we believe it should be secondary.

Of course, a major challenge is that the history of science has the benefit of long hindsight. Pais was able to look back and understand what happened decades earlier, combining both serious historical research with a deep understanding of the underlying ideas134. But even with that depth of understanding it's harder to do this nearer to real time, when the dust hasn't yet settled. This is certainly a challenge, and maybe even an intrinsic problem: part of what makes great scientists great is that they have superb taste (or, in some cases, luck) in what they choose to work on. It may be intrinsically difficult to scale this kind of taste. Still, we believe this approach is worth making a central plank in any approach to program evaluation.

Three cheers for structural diversity: As a general principle for evaluation, we believe there should be a strong presumption favoring structural diversity in social processes. Such structural diversity is a core, precious resource for science, a resource enlarging the range of problems humanity can successfully attack. The reason is that different ambient environments enable different kinds of work. Problems easily soluble in one environment may be near insoluble in another; and vice versa. Indeed, often we don't a priori know what environment would best enable an attack on an important problem. Thus it's important to ensure many very different environments are available, and to enable scientists to liquidly move between them135. In this view, structural diversity is a resource to be fostered and protected, not homogenized away for bureaucratic convenience, or the false god of efficiency. What we need is a diverse range of very different environments, expressing a wide range of powerful ideas about how to support discovery. In some sense, the range of available environments is a reflection of our collective metascientific intelligence. And monoculture is the enemy of creative work.

Structural diversity is sometimes opposed. We've heard people describe instances of it as "too chaotic" or "too confusing". When startup initiatives are announced they often attract far greater critical scrutiny than incumbents. This is a recurring pattern: from Janelia Farm to the Thiel Fellowship to the Santa Fe Institute, and many others. We've been repeatedly told it's "common knowledge" that these are underperforming or even failing. Yet upon closer scrutiny it often seems that such initiatives are outstanding successes, with negative impressions not driven by serious evaluation.

An instructive example is the Thiel Fellowship. Begun in 2010, this is a program by which talented people of about age 20 are offered $100k to drop out of university and pursue ambitious independent projects. The total cost of the program is a few million dollars per year, equivalent to a handful of NIH R01 grants. Critical scrutiny has included negative articles about the Fellowship in The New York Times, The Wall Street Journal, The LA Times, and other prominent sources. Former Harvard President Lawrence Summers described it as: "the single most misdirected bit of philanthropy in this decade". And yet it has supported foundational work in cryptoeconomics, artificial intelligence, and a wide variety of startups. We believe any reasonable evaluation would show it to be an extraordinary success, if compared dollar for dollar with most other programs. Certainly, the Thiel Fellowship's budget is a rounding error on the scale of Harvard and similar institutions.

One way to think about it is this: if you donate $10 million to Harvard or a similar incumbent, then a reasonable rough model is that that money will go to the best thing Harvard didn't already fund. But Harvard has an enormous budget. And so it will fund something far down Harvard's priority list: something Harvard thought wasn't good enough to fund with its first several billion dollars in spending. By contrast, startup ventures striking out in new directions are doing the best they can in those directions. Diminishing marginal returns haven't yet set in. Structural diversity may be chaotic and confusing, but it also offers a chance to escape the tyranny of diminishing marginal returns, and to enable new kinds of creative work. We discuss some of the detailed challenges of achieving structural diversity in an Appendix136.

A metascience accelerator to drive change

Behind all this you can imagine a metascience accelerator, either a new organization or a strongly empowered part of an existing organization, taking a profusion of ideas through the metascience learning loop, over and over:

Each time through the core loop would provide more information. The accelerator would provide support to scale trial programs that were succeeding. It would publish detailed post mortems not just of successes, but of failures. The accelerator would provide mentors who deeply understood the key theoretical ideas of metascience, and also provide practical support for metascience entrepreneurs. Such support would assist in: tool-building, marketing, branding; in fairly evaluating programs, aiming to achieve decisive new metascience results; in instigating crises as a way of achieving change; and aggressively promoting improved new approaches, and attacking poorly performing incumbents. It would be tempting for the accelerator to avoid owning failures: consider the example of the Gates Foundation, which solicited an RCT to evaluate the impact of an educational initiative; and then tried to bury the results when they didn't like them137. But it is only in candidly admitting failure that the community can actually learn; in this, there is credibility. Of course, there is a tension between managing funder relations (which often requires writing laudatory self-reports), and actually improving the way you do science. The latter must take precedence over the former for such an accelerator to be anything more than an empty exercise.

Important points not otherwise covered

Let us conclude Part 2 with a brief discussion of three issues that are important for metascience, but which we have ignored in the core of our treatment.

Using "metascience" to unintentionally make things worse: There's a disastrous future possible where "metascience" is used by bureaucracies as a lever to justify ever-increased control. "Let's measure and improve things!" is the bureaucrat's sincerely well-meaning war cry. It is understandable that the people running funders and research institutions – often brilliant, imaginative people – will think they know just the right way to improve things. Yet the usual de facto outcome of such efforts is to centralize decision-making power, and so to suppress much of the messy, illegible potential latent in the community of scientists. We fear a future in which armies of metascientists go forth, suggesting "improvements" which move more power to the center, and impose an ever larger burden on scientists. One sees this already in some misuses of scientometrics as a tool of measurement and control. Capital and centralized authority may determine much about how people behave, but they have no influence whatsoever over how Nature behaves. The correct response is to impose an expectation of metascientific humility on incumbent organizations; and to hold to especially high standards of evidence any proposed "improvement" which increases the power of incumbents or the burden on scientists.

Contrasting metascience with macroscience: There's a point of view in which what matters for science is government spending on research as a fraction of GDP; the number of PhDs a country produces; number of papers; number of citations, and so on. These, in turn, are purportedly related to measures such as GDP and productivity growth. Nations do "capacity building" in "priority areas". It's science-by-the-barrel, viewing it as a commodity to be bought and sold. This macroscopic view is very different from the view we've developed, which focuses on understanding mechanisms of action, and improving specific social processes and culture. By contrast, the macroscopic view does not usually have strong specific mechanisms of action, other than "more science is good", and contingent ad hoc assumptions (e.g.: "AI is important, we should invest more in it"). To some extent the two views are complementary. But we also believe that over the long term most macroscopic value is likely to be a consequence of improved social processes and culture. It is such improvements which can produce qualitative changes inaccessible through merely "more science".

Metascience doesn't determine what we value: There are central questions metascience can't answer. Consider a question like how fast NSF versus NIH versus DARPA grows. Or: how to prioritize efforts at diversity, equity, and inclusion. These aren't purely metascientific questions, but are about what human beings think are important and valuable. They necessarily and correctly involve political priorities, which reflect the values and judgments of stakeholders. Metascience can help us understand how better to achieve particular desired ends, but cannot on its own determine what those ends ought to be. The distinction is related to Hume's is-ought distinction. You can, for example, embark on a metascientific program of understanding how to design and achieve particular risk profiles. As those principles and design ideas are developed, they can be deployed in service of better achieving the goals suggested by our values and political priorities. But they won't determine those values and political priorities! We've ignored these issues through the main body of the essay. We expect that, as a practical matter, many of the ideas of metascience can be developed independently of values and stakeholder interests. But ultimately all must play a role.

Is transformative improvement possible in the culture of science?

This present moment used to be the unimaginable future. – Stewart Brand

We've just written 30,000 words about improving the culture of science. To what end? Even if we could rapidly change science's culture, maybe today's discovery ecosystem is close to the best possible? Can we only make tiny incremental improvements, or are truly transformative improvements possible – changes that would surprise or shock scientists of previous generations with their novelty? Of course, this question is beyond the scope of what we can discuss with confidence. But we cannot resist a few speculative thoughts.

One reasonable conjecture is that today's ecosystem is close to the best possible. In this view, any kvetching you hear is either people looking for minor improvements, or people who are mistaken or expressing sour grapes. It's easy to find superficial support for this position. As we noted in the introduction the best scientists today are doing astounding work. And if you talk with such people it is natural to wonder: "is it possible to do better than the way this amazing person is already operating? Surely all that's needed is to scale up the organizations that produce such people, and to improve the way we support them?"

But there is a hole in such an argument. Even if you accept that certain types of work are superbly supported by today's system, that doesn't mean there aren't bottlenecks preventing other crucial types of work. In this view, there are researchers with particular canonical working styles – perhaps a Robert Langer or George Church138 – which accord well with what universities are oriented to support. In those cases, they have large labs, many grad students and postdocs, a stream of papers and grants (and overhead) in the vanguard of mainstream fields, a combination of salespersonship and management and leadership and science. This is terrific and we're glad universities support it. But perhaps there are other not-currently-canonical working styles – that of a Katalin Kariko or Stephen Wiesner or Douglas Prasher, for instance – which aren't so congenial to today's institutions. And perhaps those working styles enable scientific work which is crucial, but near-impossible to do within the existing ecosystem. That is, you can stack up as many canonical researchers as you like and they still won't do the non-canonical work; it's a bottleneck on our capacity for discovery.

In this view, a benefit of new social processes and new cultures is that they can enable new types of work not easily possible in existing environments. And, furthermore, those new types of work may play an irreplaceable role in furthering science. As just one example, we've suggested increasing the rate at which new fields are founded. Implicit in this suggestion is three ideas: (1) that field-founding often requires types of work which are difficult to support within today's discovery ecosystem; (2) those new types of work can be enabled by new types of environment; and (3) science is currently bottlenecked on field-founding, and so enabling field-founding work would unlock tremendous latent potential for discovery. In this way, creating a new cultural niche can have a transformative effect on science. Analogous arguments can be made for many of the other ideas sketched earlier.

Of course, new environments and new social processes don't just change what types of work are immediately possible. Over the long term they may transform who goes into science; how they grow and change over the course of their lives; the collective norms and shared assumptions and tools they have available to them; and the structure of the networks that carry expertise and problems and resources to individuals. This in turn changes the problem flux they're exposed to, the types of people they talk to, the resources and expertise available, and the values and incentives shaping their growth. Indeed, it potentially changes the entire cognitive and emotional experience of being a scientist. It is those longer-term changes which can enable the "same" person to do radically different types of work, including work very difficult to do today. This is why we believe cultural shifts may completely transform science.

It's worth contrasting this to a common romantic ideal, memorably expressed by Vannevar Bush, that science is best done through "the free play of free intellects". Or, as many people have said to us: "Aren't top people at top places able to do as they want? Why not just do more of that? Let's minimize red tape, reduce the amount of administration and grant-writing, and get out of the way!" Versions of these sentiments are also sometimes expressed in the what's-wrong-with-science-and-how-to-fix-it genre. There is merit in these ideas, but resource constraints mean the question "whose intellect gets to play freely" becomes omnipresent. After all, the phrase originates in the document credited as the origin of the US National Science Foundation! So while "the free play of free intellects" seems attractive, it is not usually attainable. But more importantly, such ideas fail to address the crucial role of culture and environment. It's an individualist philosophy, while discovery is fundamentally a networked endeavor139. Changed culture doesn't matter because it eliminates inconveniences in scientists' lives. It matters because it transforms scientists themselves, and the structure of their networks, and so transforms and enlarges the kinds of work that can be done140.

In this essay we've attempted to peer into and shape the future. It's a sketch, and the ideas and language need development. It needs to be improved by experience. And it's possible to make a case against this vision. The social processes of science have incredible inertia: it's like turning a supertanker around. It's easy to conclude the situation is intractable. Still, there is value in persistence: the culture and social processes of science are jewels of human civilization. We have an opportunity to make them rapidly self-improving, and to develop metascience into an engine driving improvement in the way humanity understands the world. It is humanity's ability to improve the discovery ecosystem that will ultimately determine the long-run health of science.


MN's work was supported by the Astera Institute. KQ's work was supported by Generally Intelligent. Our thinking has been formed in conversation with hundreds of people over several decades. Our thanks to them all. Especial thanks to: Scott Aaronson, Dorit Aharonov, Dave Albert, Josh Albrecht, Sam Altman, Marc Andreessen, Nadia Asparouhova, Pierre Azoulay, Nick Beckstead, Juan Benet, Alexander Berger, Nick Bloom, Ed Boyden, Adam Brown, Stuart Buck, Howard Burton, Carl Caves, David Chapman, Jennifer Chayes, Seemay Chou, Ike Chuang, Matt Clancy, Patrick Collison, Tyler Cowen, Dom Cummings, Laura Deming, David Deutsch, Artur Ekert, Chris Fuchs, Julia Galef, Anastasia Gamick, Danny Goroff, Ilan Gur, Gwern, Melissa Hagemann, Celine Halioua, Timo Hannay, Robin Hanson, Demis Hassabis, Sabine Hossenfelder, Anton Howes, Elanor Huntington, Tim Hwang, Heather Joseph, Tom Kalil, Alan Kay, Julia Kempe, Ottoline Leyser, Adam Marblestone, Andy Matuschak, Jed McCaleb, Gerard Milburn, Evan Miyazono, Cameron Neylon, Brian Nosek, Chris Olah, Catherine Olsson, Tim O'Reilly, James Phillips, Ben Reinhardt, José Luis Ricón, Gerry Rubin, Halina Rubinsztein-Dunlop, Terry Rudolph, Grant Sanderson, Ben Schumacher, David Siddle, Star Simpson, Lee Smolin, Rob Spekkens, David Spergel, Peter Suber, Umesh Vazirani, Bret Victor, Marc Warner, Mike Webb, Eric Weinstein, Andrew White, John Wilbanks, and Heidi Williams. For feedback on drafts of the essay, we thank: Kat Baney, David Chapman, David Lang, Evan Miyazono, Brian Nosek, and Janelle Tam. Especial thanks to: Patrick Collison, with whom MN collaborated on a precursor project; to Grant Sanderson, for initial encouragement to undertake this project; to David Chapman for help and encouragement in overcoming a crucial difficulty, and for many generous conversations; and to Brian Nosek for conversations about the origin of the Center for Open Science. The icons for navigation were created by Freepik-Flaticon, and the table of contents benefits from code supplied by Michael Keenan, and adapted from Jim Babcock. Finally, thanks to the wonderful, maddening, collective genius and collective boor that is Twitter – simultaneously history's most able, most irritating, and most distracting research assistant and research stimulant.

Citation information

For attribution in academic contexts, please cite this work as:

Michael Nielsen and Kanjun Qiu, "A Vision of Metascience: An Engine of Improvement for the Social Processes of Science", https://scienceplusplus.org/metascience/index.html, San Francisco (2022).


Other opportunities to transform science: artificial intelligence, India, China, space colonization, and intelligence augmentation

There are several external factors we've ignored through this essay, which may eventually be highly relevant to the future progress of science. We'll now briefly turn to those factors, and how they relate to metascience.

Perhaps the most fashionable is artificial intelligence (AI). Some well-informed people confidently believe AI will soon transform science (and the world), while others believe it's no big deal. The recent success of DeepMind on protein structure prediction is impressive: we humans are now learning a great deal about proteins from our machines. And while that's "merely" one (important) problem, one wonders: will this kind of breakthrough become routine? Are AI systems going to drive more and more progress in science? Will they drive improvements in themselves141? Might they eventually transform the engine of scientific discovery? Ilya Sutskever, Chief Scientist at the AI company OpenAI, tweeted: "In the future, it will be obvious that the sole purpose of science was to build AGI [artificial general intelligence]". The statement is absurdly over-the-top – the billions of people whose lives have been improved by pre-AGI science could not possibly agree – and was perhaps meant merely as marketing for his company. But it may have a grain of truth.

For our purposes, the point is that if AI does completely transform science, then modifying the social processes of science today may matter little for the long run. Of course, this argument can be used in favor of not thinking about anything except AI in the medium-term future. It's AI-as-thought-terminating-cliche. We know some people with this view. But we believe there's enough uncertainty about this that improving the social processes of science remains a question of considerable interest.

AI isn't the only event which might plausibly cause a new scientific revolution. India and China are rapidly developing their parts of the overall discovery ecosystem. In so doing, they may decide to mostly copy the institutional approaches of the US and Europe. But we hope they will seize the opportunity to be different and (perhaps) vastly better than extant institutions. Doing so will require imagination and insight and courage; we hope metascience will contribute useful ideas. It's a tremendous opportunity for those countries and, we hope, for humanity142.

Another potentially transformative opportunity is humanity's colonization of space. This is similar to the opportunity for science afforded by India and China, but ultimately far larger (and more challenging). The colonization of space is still just beginning, but seems likely to gain pace over the next century and in the centuries that follow. It will provide enormous challenges, and also incredible opportunities to trial new institutional ideas. Again: we hope our descendants will have the courage and creativity to reimagine our institutions. Given the courage and imagination it will take to colonize space, that seems likely! We hope and expect that a flourishing field of metascience will contribute to any such reimagination.

Finally, there is the possibility of new technologies for intelligence augmentation. Humanity's existing technologies for intelligence augmentation – language, writing, the alphabet, mathematics, the printing press, early computers – have each transformed how human beings think and discover. Will we develop new technologies for intelligence augmentation that enable further such transformations143? As with AI it's difficult to say how any such transformation would relate to metascience. But insofar as metascience is able to work toward universal truths about discovery it will inform the future of science no matter what.

The trouble caused by scale

As we noted in the body of the essay a major challenge in evaluating and comparing novel social processes is that the scale of the trial may matter in ways that are hard to reason about. In particular, outlier-dominated processes can show surprising relationships between the scale at which the process is trialled and the outcome. This can, for example, lead you to falsely conclude that a trial process performs worse than an incumbent, simply because the incumbent has much more scale.

To understand what's going on, it helps to think about the (non outlier dominated) world we usually live in. If you put ten times as much fuel in your car, you expect it to go about ten times further. If you plant ten times as much of a crop, you expect about ten times as much of a harvest. Of course, in both cases there are exceptions – maybe some of the larger land area isn't very good, causing part of the crop to fail. But the basic intuition is right, and it's relatively straightforward to reason about variations.

This has consequences when you make comparisons. Suppose you are comparing the fuel efficiency of two cars, and are told that one of the cars will get 100 miles on 5 gallons of gas, while the other car will get 1200 miles on 50 gallons of gas. Obviously, you'd conclude the second car was more fuel efficient! The strange thing about outlier-dominated processes is that this reasoning can utterly fail to hold, and fail to hold in ways that are difficult to reason about.

There's a nice toy model that can be used to illustrate this. The model is not realistic – that's not its point. Rather, the point is to illustrate a tricky issue that we believe holds more generally. The purpose of this Appendix is to lay out the toy model, to help readers build intuition.

Suppose we have two approaches to science funding, which we'll call C, the control (representing an existing approach), and I (the intervention, representing a new approach). The basic idea is to suppose we have some way of quantifying the importance of discoveries, and both C and I have heavy tails under this measure of importance, but: the intervention is much heavier tailed (i.e., is more likely to produce enormous outliers), while on typical outcomes the intervention performs a little worse than the control. Qualitatively, the idea is that the distributions may be depicted as:

If the intervention is worse in typical behavior, then for a small trial the intervention I will appear worse than the control C, since a much smaller number of samples means I has much less opportunity to benefit from the heavy-tailedness. It's not that the existing approach C is actually better. It's rather that it has tremendous scale, which means it benefits from occasional outlier outcomes. By contrast, the intervention I has relatively few opportunities to benefit; if it had similar scale, the outliers would be more pronounced than C. Put another way: in heavy-tailed systems the number of "shots on goal" you get matters a lot.

Of course, scientific discoveries aren't fungible, and can't be quantified in this way. But this is nonetheless useful as a toy model or intuition pump to illustrate the underlying issue. In particular, it's easy to construct explicit examples of distributions with the properties just described. One example is if C is a log-normal distribution with \mu = 0 and \sigma = 1.4; I is a power-law distribution with shape parameter \alpha = 2.3 and with minimal value 0. We did repeated computer simulations144, drawing 100 times from the intervention, and 100,000 times from the control. When we did 1,000 such simulations, we found that roughly 63 percent of the time: (1) the median of the intervention was lower than the control; (2) the average values of the trial intervention were also lower than the control; but (3) if the trial intervention was extended to have the same number of samples as the control, it actually had a higher average value, with the average being substantially shifted due to more opportunities for outliers. Moreover, the shift was often quite significant, with an average 39% increase when the trial was extended out. The scale really matters, enough so that a comparison would likely be strongly misleading without taking scale into account. How best to do that we leave as an open problem.

Of course, this toy model is contrived and artificial. But the fundamental point is clear. Indeed, it even seems plausible as a description of certain historic situations, like that of Bell Labs, as discussed in the body of the essay. As mentioned there, the idealization of Bell Labs is typically based on a few very important outcomes: the transistor, information theory, theory of superconductivity, and a few others. But Bell Labs was enormous, and it's difficult to untangle those successes from the fact that their scale meant they were able to take a lot of shots on goal. Some looked like Shannon producing information theory. But most are justifiably forgotten. And this makes it tricky to compare them to much smaller operations.

Challenges in achieving structural diversity

In the body of the essay we've advocated for structural diversity in scientific culture. There are many associated challenges, and in this Appendix we describe a few. Most broadly, there's the portfolio construction problem: what regulating mechanisms should set the portfolio of available environments?

The portfolio construction problem has many subproblems. One is the unbounded growth problem. To illustrate this problem, let's return to the example of Focused Research Organizations (FROs). Adam Marblestone, the co-creator of FROs, has told us that he frequently hears the critique that "not everything should be a FRO". He sometimes hears this even after he repeatedly emphasizes that "not everything should be a FRO". It's easy to see why people worry. FROs have just recently been launched, and it's possible the first few FROs will go poorly, and the idea may sink. But it's also possible that one of the first FROs will be a spectacular success. All of a sudden there may be many laudatory articles about the model. Other funders may start to launch FRO-like programs. That may lead to still more articles, and more FRO-like programs. In such a situation: what sets the final scale? Do funders become FRO-happy? FROs hit many satisfying spots for them: they're highly legible, goal-oriented, measurable, with good stories and optics. All things funders love; FROs are, in many ways, a too-natural thing for funders to want to scale! Perhaps we reach a state where some scientists begin to complain that everything is FROs, FROs, FROs, that's the only way to get anything funded. What was healthy growth becomes malignancy. Or do we instead reach a situation where FROs are one component in a diverse heteroculture, used when appropriate, but without other work being unnaturally shoehorned into the model? What sets the scale?

The same story can be repeated for many ideas for new social processes. We very much like Registered Reports, but it would be a bad mistake if they became compulsory. More exploratory and speculative work is also extremely important. Ideas like fund-by-variance and fund-by-lottery likely have significant advantages over conventional approaches; but they shouldn't replace them entirely. The trouble in all these cases is the lack of good self-regulating mechanisms. It's easy to end up with research monocultures, when it would be far healthier to have a diverse portfolio of approaches co-existing, with the relative scale regulated by some mechanism ensuring the ongoing health of science.

The fundamental question is: what sets the scale for any such program? Is it fashion and fickle perception and politics and highly legible stories? Or is it the scientific contribution, and the value of the program as a part of the wider ecosystem? At present, it is almost entirely the former, and only incidentally the latter. Of course, fashion and fickle perception and politics often do not present as such. They present as enthusiastic articles by well-meaning people in Nature and Science and The New York Times. They present as enthusiastic and brilliant founders of new institutions, with their pet ideas about how things could be better. This is excellent for generating new social processes, but a terrible basis for evaluation145. The net result is a natural monoculture, an oligarchy in which there's either far too much or not nearly enough of something. This is a disaster for science.

Unfortunately, we don't have good answers for these questions! Just to summarize the problems as we see them: there's the portfolio construction problem, and three related subproblems. (1) The unbounded growth problem: finding a healthy regulator for the ultimate scale of a process; (2) The metascience alignment problem: whether the ultimate scale of a process is set by the value to humanity and science, or by fashion and politics; (3) A problem we haven't yet explicitly mentioned, the extinction problem: how to scale down processes which are working poorly, a kind of creative destruction for science. This is something science currently does extremely poorly; the result is long-lived institutions and communities and processes which crowd out new entrants146. It's possible there are no perfect or near-perfect answers to any of these problems. But we believe far better answers are possible than we have today, through imaginative theoretical work and mechanism design. In the meantime, even without solving these global problems of portfolio construction, it seems worth focusing more locally, on identifying new processes that deserve to be amplified.


  1. We'll use the term discovery ecosystem to denote the set of organizations and social processes that engage in basic science. It includes much work done by universities, national laboratories, some industrial laboratories, and many independent research institutes, thinktanks, and independent researchers. In speaking with academics we've noticed they sometimes assume that nearly all noteworthy basic science is done within universities. This is, however, not even close to true. We use the term "ecosystem" to help suggest that many of these organizations wink in and out of existence, or change form, and that there are familial similarities between many of them. In general, we've found evolutionary metaphors helpful in thinking about this landscape.↩︎

  2. A partial but useful catalog of new research organizations has been collected by Sam Arbesman, the Overedge Catalog. See also: Nadia Asparouhova, "Understanding science funding in tech, 2011-2021" (2022).↩︎

  3. Incidentally, here and throughout we accept this dichotomy between funders and research institutions. It is a false dichotomy (though a useful present-day approximation), making certain assumptions about the way in which labor should be divided. It's easy to come up with example mechanisms which fall outside this dichotomy: an obvious one is distributed mechanisms for the allocation of resources, such as have been proposed by the Decentralized Science (DeSci) community, with ideas like Distributed Autonomous Organizations (DAOs). And, of course, organizations like professional societies and journals also fall outside the dichotomy. Instinctively, we feel that the metascience accelerator discussed later in the essay really ought to sit outside this dichotomy too: it's a set of ideas that should have multiple instantiations.↩︎

  4. In the footnotes we cite some antecedents we believe the reader may find of especial interest, but our citations are rather ad hoc and incomplete, since our intent in listing these ideas is merely to evoke some of the many possibilities. We apologize to and ask for understanding from those whose work we have unfairly slighted.↩︎

  5. There are many variations of this idea possible. Another one is to assign an Advocate and a Devil's Advocate for each grant. Yet other variations back off from a hard insistence on variance, and instead mix polarization and consensus models.↩︎

  6. This fund-by-variance strategy can be done at other levels of abstraction, not just the level of individual grants. For instance, instead of getting involved in new program areas by consensus, funders could look for program areas which are controversial. One implementation would be to only start a new program area if it is strongly supported by some program managers, and strongly opposed by some program managers. The idea is motivated by the observation that DARPA has often chosen highly unusual programs, which then very nearly force grant applications which would be considered unorthodox by other funders. Similar select-by-variance ideas could be applied to hiring, selecting areas for new departments at universities, even for setting up entirely new institutions.↩︎

  7. There is at least some weak evidence against the idea: Adrian G. Barnett, Scott R. Glisson, and Stephen Gallo, Do funding applications where peer reviewers disagree have higher citations? A cross-sectional study (2018).↩︎

  8. Note that this is a proper noun, and so capitalized, whereas the preceding item, fund-by-variance, is an abstract strategy applicable in many contexts, and so is lower case. This kind of distinction will recur throughout the essay. Where possible we prefer concreteness and capitalization, and rely on the reader's imagination to generalize. However, when it is more informative to work with a higher degree of abstraction we shall do so.↩︎

  9. This term has also been used by Samo Burja in his 2019 essay Intellectual Dark Matter. His focus is on tacit and lost knowledge, and so is rather different than ours, which is focused on hidden ideas with potential to drive scientific discovery. Still, all seem like part of a larger category.↩︎

  10. It's surprisingly difficult to find good data on tenure rates. One useful report is: Michael J. Dooris and Marianne Guidos, "Tenure Achievement Rates at Research Universities" (2006). That report suggests that at many research universities something like 50-60% of tenure track faculty are eventually tenured at that university, with considerable variation. However, much of the deficit is due to faculty moving to jobs elsewhere. The actual tenure rate for people who go up for tenure review is, for the one university where they have data (Penn State), much higher: 89%. All these factors (and more) would need to be taken into account in the design of a serious tenure insurance program. However, for the sketch in the essay, an estimate of 80% seems fine as a ballpark figure.↩︎

  11. The idea of tenure insurance was, as far as we know, originated in conversation between Patrick Collison and Michael Nielsen.↩︎

  12. The report is at Qualitative Evaluation of Complete Projects Funded by the European Research Council (2017). There are various ways one might try to reconcile the facts we mention. The two best we can think of: (1) The original project ideas really did overwhelmingly fail, but the projects achieved some other extremely valuable outcome; (2) The use of "high-risk" should be understood as a socially-constructed a priori judgment: if describing a project to others in the field, most would say "that sounds very risky", but in fact the recipient was so resourceful (or perhaps lucky) that they succeeded anyway. If the ERC meant either of these things, they failed to note it in their report. We believe a more likely hypothesis is that the review was a marketing document intended to make the ERC look good, not a serious evaluation of risk.↩︎

  13. Of course, there are many problems with this proposal. Most obviously: won't this incentivize program managers to actively select bad projects, to reach their failure quota? It's easy to think of plausible ways this might be avoided. However, in these brief program evocations we won't try to debug the proposals. In a similar way, failed / not failed, and fired / not fired are demarcations which need more thought and refinement. As a practical matter, debugging and iteration will certainly be required, not just of this proposal, but of all of them. Again: these are program evocations, not full proposals.↩︎

  14. Perhaps the other most-cited example from relatively recent times is Bell Labs. As we'll discuss later in the essay, Bell Labs was certainly very good, but its budget was also enormous – at their respective peaks, more than an order of magnitude larger per year than PARC. Bell Labs also persisted as a research lab for far longer. On a per-dollar basis it seems likely to us that PARC was far more successful.↩︎

  15. Incidentally, it's possible to make a case that PARC would have inevitably declined in the 1980s, with key staff exiting for more applied positions at companies such as Microsoft, Apple, and even Atari. The story of the changes that actually happened are told in chapters 24-26 of Michael Hiltzik's book "Dealers of Lightning" (1999). If a severe decline was inevitable, then one might argue that an NSF or similar acquisition would have been a bad idea. That would not, however, establish that such acquisitions are in general a bad idea. The Google acquisition of DeepMind seems like an example of this acquisition strategy succeeding well, at least as judged by research impact.↩︎

  16. It could plausibly be net negative for humanity to move brilliant teenagers from the developing world to richer countries. Certainly, jingoists or the self-interested in richer countries could easily make a superficially plausible case that it would be a net positive for humanity. But that doesn't make it true, and the situation seems to us rather complicated to assess, with many positives and negatives.↩︎

  17. There's a certain type of closed-minded "scientist" who would hate this. These people could, perhaps, be sought out to act as insurance counterparties. Certainly, if they're genuinely outraged at the possibility of (say) a prize for perpetual motion, then they should be willing to accept (say) a $100 premium against the possibility of a $10,000 payout. After all, "it's free money!" is what they (claim to) believe. Put together 10,000 such people, and you have a $100 million prize. It's amusing to imagine a DAO of people poo-poohing perpetual motion, acting as counterparties for the prize. Though impossible, at present, to imagine such people joining a DAO.↩︎

  18. An obvious argument against this is that many such prizes would be dwarfed by the intrinsic commercial incentive. Even a billion dollar prize for cheap cold fusion would (plausibly) be tiny compared to the commercial payout. While this argument is reasonable, it is perhaps less true in practice than one might think. There are many historic instances where the person or organization who creates a technology was not suited to building a successful company around that technology. There's also something about the concrete vividness of such a prize: the founder of the $10 million X-Prize, Peter Diamandis, has estimated that more than $100 million was spent trying to win the prize. That is, despite the superficial rational investment calculus, in practice investors behaved quite differently. It's interesting to ponder the reasons for what seems like overinvestment. In some sense it seems that the investors must have viewed the possibility of a prize (and the publicity and other halo effects) as a kind of subsidy, and on the margin that was enough to change their behavior.↩︎

  19. When we talk with funders about this idea we encounter a deluge of objections. As far as we can tell these are mostly proxies for "we'd be really uncomfortable doing this, and would much prefer to not be publicly accountable for bad decisions in such a clear cut way. Certainly, we don't want individual program managers [etc] to be accountable". While that is perhaps an uncharitable interpretation of their comments, we believe it is likely true, in part because both of us so intensely dislike being held accountable in such a way. It really is deeply unpleasant! Kudos to the partners at Bessemer Venture Partners for their willingness to do this publicly and with such good humor.↩︎

  20. We owe this observation to James Phillips.↩︎

  21. This is attested in part by scientists who participated in the Fast Grants program; of those who received a grant, "78% said that they would change their research program `a lot' if their existing funding could be spent in an unconstrained fashion." (Patrick Collison, Tyler Cowen, and Patrick Hsu, What We Learned Doing Fast Grants). It is also attested by Thomas Sinkjær, former director of the Danish National Research Foundation, explaining that he and the board of the foundation: "spoke with more than 400 young scientists and kept hearing the same depressing refrain: many were writing grants not for work they really wanted to do, but for projects they thought could get funded. Often, they were not even bringing their best ideas to the table." (Thomas Sinkjær, Fund ideas, not pedigree, to find fresh insight, Nature (2018)).↩︎

  22. Nick Szabo, The Birth of Insurance (2005).↩︎

  23. A reasonable response is: "That's not much good if you're a sailor". Of course, this is an analogy to make a point about systems change, not meant to be directly applied. Nonetheless, it's worth noting that many of the points we discuss amount to ways of de-risking individual sailor / scientist lives – ideas like tenure insurance.↩︎

  24. Speaking very loosely, and with a great deal of hubris, in this essay we aim to do for revolutions in the social processes of science what Kuhn did for scientific ideas in "The Structure of Scientific Revolutions". We amused ourselves while writing with the title "The Structure of Scientific Social Revolutions". Of course, a big difference is that Kuhn was able to analyze an extant situation already designed to support such revolutions in ideas. Science today is not designed to support social revolutions. And so a better title might be: "Structures to Support Scientific Social Revolutions".↩︎

  25. We are using terms like "proto-field" and "field" here, which may be taken to imply that metascience is purely a research field. That is not the case. Rather, as we shall argue later, to be successful metascience must develop and intertwine three elements: an imaginative design practice, an entrepreneurial discipline, and a research field.↩︎

  26. Our discussion is framed in terms of funders, though the model may be easily and fruitfully adapted to other research organizations.↩︎

  27. There are several related but different framings which are also helpful. One that's quite close, but suggests different avenues of investigation, is that this is about search-and-inference. As this model is meant primarily to be a generative design tool, such changes in framing can be very helpful. A more radical change in framing is suggested by Joël de Rosnay's 1979 book The Macroscope. By macroscope, de Rosnay means (roughly) an instrument capable of seeing the big picture of some complex system. It may be thought of by contrast to a microscope: whereas a microscope lets you view the fine detail of a system, a macroscope lets you form an overall view of a complex system. So: funder-as-detector-and-predictor may also usefully be rephrased as funder-as-macroscope-and-predictor.↩︎

  28. This framing owes a lot to: F. A. Hayek, "The Use of Knowledge in Society", The American Economic Review (1945).↩︎

  29. The trend has been widely discussed, and there is a large literature. A stimulating recent paper with many references is: Haochuan Cui, Lingfei Wu, and James A. Evans, Aging Scientists and Slowed Advance (2022).↩︎

  30. Experiments along these lines are being done by the Thiel Fellowship and New Science. In some sense, there are many similar experiments within academic science: things like the Royal Society Fellowships in the UK, for instance, or the Hanna Gray Fellowships from HHMI. Informal conversation suggests that these are often viewed as minting young superstars, who are anticipated to go on to great things. Of course, it's very different to have been granted an endowed professorship rather than these temporary positions.↩︎

  31. An introduction to FROs may be found in: Adam Marblestone, Anastasia Gamick, Tom Kalil, Cheryl Martin, Milan Cvitkovic, and Samuel G. Rodriques, Unblock research bottlenecks with non-profit start-ups (2022).↩︎

  32. It's fun to think of the Century Grant Program as like FROs, but tipped sideways in time. That is, FROs aim to achieve scale in the here-and-now, while the Century Grant Program aims to achieve scale over time.↩︎

  33. A discussion of the origin of the term para-academia may be found in: Alex Wardrop and Debora Withers (ed.), The Para-Academic Handbook (2014). We're using the term in a related, but different sense, simply to mean the act of doing research outside conventional university or institutional (including industrial laboratory) settings.↩︎

  34. A nice recent example of such a program is the Pivot Fellowship recently introduced by the Simons Foundation.↩︎

  35. This, in turn, is a special case of exploring social process whitespace, which is the main theme of Part 1 of this essay, and a major theme of the entire essay.↩︎

  36. It's interesting and instructive to ask scientists for wild things they'd like to do, but can't. A tiny minority of scientists will immediately talk about such things. They may even have a side project working on some aspect of it (a point made to us by Evan Miyazono). But many are simply silent. And some will describe the boringly conventional in response – it's surprising how many will respond with whatever is currently fashionable at the NSF etc. One of the most imaginative scientists we know, a MacArthur Fellow, once told one of us (roughly): "Even very conservative scientists often have wild ideas. But it's often hard to get them to talk about them. They have good reasons to keep quiet. My preferred method is to get them liquered up. It still turns out that even some very good scientists have no wild ideas. But sometimes you're really surprised when someone you thought was conservative has many crazy ideas." Even with all that said: over time, many people become programmed by the system, and learn to see only those things which are possible. Still, it's fun to have these conversations.↩︎

  37. In Christopher Sykes, "No Ordinary Genius" (1994).↩︎

  38. George A. Akerlof, "The Market for 'Lemons': Quality Uncertainty and the Market Mechanism", The Quarterly Journal of Economics (1970).↩︎

  39. Academic scientists sometimes talk about "high risk, high reward" research, but are often themselves extremely rigid and conservative. As an example, after the boutique orange juice company Juicero failed, on social media we saw many people, including many scientists, lambast Juicero and the venture capital ecosystem, claiming it was obviously going to fail. But they were doing so merely with eagle-eyed hindsight: indeed, many of the reasons we saw stated for Juicero's failure applied equally to the boutique coffee company Keurig, which was an enormous success. An appetite for risk means that some things will fail badly; such failures are a natural consequences of correct systemic behavior. (This is, it is worth noting, quite distinct from cases of fraud, such as Theranos.)↩︎

  40. Something we've struggled with in writing this section is depth of description. We're trying to evoke an (enormous) design space, and the idea of what imaginative design means. We're trying to explore broadly, asking many fundamental questions, playfully proposing many ideas, and exploring many heuristics for design. But each of these ideas is really a world unto itself, and needs an extended treatment. In early drafts of the essay we expanded out some of those ideas. Unfortunately, it made the essay impossibly long, and destroyed much of the evocative effect. And so we've gone for evocation at the expense of depth. And that depth really is a sacrifice: to make any of these ideas work would require very considerable depth. We hope you forgive us for this rapid tour of part of the design space, and keep in mind the depth that is really required.↩︎

  41. Paul Dirac, "Directions in Physics" (1978).↩︎

  42. Our experience in talking to current funders is that what they consider "early" is usually very late, often one or several decades after actual early work. Real early-stage work seems to be almost entirely illegible to such funders, a form of intellectual dark matter. For calibration on what we mean: early work on AI and AI safety was no later than the 1940s, prior to Turing's famous 1952 paper; on quantum computing was no later than in the 1980s (or arguably the 1970s, prior to Feynman's famous 1982 paper, but after, e.g., seminal contributions by people such as Wiesner, Bennett, and Holevo); and on nanotech no later than the 1970s, prior to Drexler's famous 1981 paper, though well after Feynman's speech "There's Plenty of Room at the Bottom". For a funder to be good at funding early-stage work they'd need an approach that could plausibly fund that kind of work. Few do.↩︎

  43. As far as we know, this idea has not been thoroughly explored. Indeed, even the basic thesis of a connection between risk and reward is on shaky ground: it's easy to think of examples of project ideas which are high risk, low reward. In financial markets, the connection between risk and reward is explained by portfolio theory, with the simple (and mostly empirically true, though there are exceptions) observation that if two portfolios have the same expected value, but one has higher risk, investors will usually prefer the lower-risk one, so driving down its price. Thus the rule of thumb: high risk, high reward. But this reasoning fails in multiple ways in science funding: there's no obvious notion of expected value; no thick market of funders driving competition on price; indeed, the very notion of a buyer is rather murky. We can attempt to rescue the argument by thinking of a very different type of market in which scientists allocate time and attention to projects with different expected intellectual payoffs; following the same logic does yield a connection between risk and reward. But several steps in the resulting argument hold no more than weakly, and we expect the conclusion to hold no more than weakly. A better argument may be the (more historically contingent fact) that there is an over-supply of funding money going to low-risk projects, due to systemic facts about the way existing funders (and their ultimate sources of capital) operate. And so funders with a genuine appetite for risk – DARPA is the only obvious large candidate – benefit greatly by having a near monopoly. They are, in some sense, underpaying relative to results. These ideas need much further development. We are encouraged by recent work in the economics of science funding literature, e.g.: Chiara Franzoni, Paula Stephan, and Reinhilde Veugelers, Funding Risky Research (2021); Pierre Azoulay and Danielle Li, Scientific Grant Funding (2020); and Wesley H. Greenblatt, Suman K. Maity, Roger P. Levy, and Pierre Azoulay, "Does Grant Peer Review Penalize Scientific Risk Taking? Evidence from the NIH" (forthcoming).↩︎

  44. A stimulating talk on this subject is: Eric R. Weinstein, [[https://pirsa.org/08090036][Sheldon Glashow Owes me a Dollar (and 17 years of interest): What happens in the marketplace of ideas when the endless frontier meets the efficient frontier?]] (2008).↩︎

  45. The claim is ironic, even self-illustrating, since the argument and simulations in the paper don't reliably demonstrate the claim, except in some very loose sense. The reference is: John P. A. Ioannidis, Why Most Published Research Findings Are False, PLoS medicine (2005).↩︎

  46. A more skeptical, albeit much more speculative and less well worked out, view on the value is outlined here.↩︎

  47. Good starting places for understanding the rise of open science in the 17th and 18th centuries are: Paul David, The Historical Origins of 'Open Science': An Essay on Patronage, Reputation and Common Agency Contracting in the Scientific Revolution (2013), and: Marie Boas Hall, "Henry Oldenburg: Shaping the Royal Society", Oxford University Press (2002). An account of the modern open science revolution may be found in: Michael Nielsen, "Reinventing Discovery", Princeton University Press (2011).↩︎

  48. This idea has its origins in 1990s discussions from the Cypherpunks, and has received a lot of modern discussion. Nice early examples include: Shirley Wu, Envisioning the scientific community as One Big Lab (2008) and: Cameron Neylon, The Science Exchange (2008). More recent discussion tends to be associated with the DeSci (decentralized science) movement. See, for example: Juan Benet, DeSci: fund, organize, and open science (2022).↩︎

  49. James Phillips pointed out to us, and several people have subsequently confirmed, that while this is likely broadly true in the US, it is not always true in other countries. There grant income may be a net loss to universities, since grants incur genuine overheads, albeit often not at the levels commonly charged in the US.↩︎

  50. We've leaned on the idea of a "design heuristic". Such heuristics are useful, but not fundamental. They're a bit like problem-solving heuristics in mathematics: ultimately, it doesn't matter how you solved such a problem, but only what you learned. The mathematician Ramanujan is said to have claimed that his work came from visions inspired by the goddess Mahalakshmi of Namakkal; fortunately for us, others can access the mathematical results, even if they lack those visions. Similarly, design heuristics are a means to an end. What is fundamental is the idea of imaginative design of new social processes, identifying and activating latent potential for discovery. So we're merely using design heuristics as scaffolding, to help illustrate the idea of imaginative design. In a nutshell: it ultimately doesn't matter if a great new design idea comes from a design heuristic, or was inspired by visions of the Goddess.↩︎

  51. Indeed, designers will happily juggle many different – even inconsistent – heuristic models to help generate imaginative new design ideas.↩︎

  52. As Herbert Simon pointed out in "The Sciences of the Artificial" (MIT Press, 3rd ed. 1996), there are also sciences studying artificial systems, and those tend to be more friendly to the design point of view. For instance, economists study the economy, but they also consider ideas like the design of new financial instruments. Still, we think it fair to say that even in most of Simon's "sciences of the artificial" the design point of view is taken to be subsidiary. Financiers tend to be much more imaginative than economists when it comes to new financial instruments, though of course the line is blurry.↩︎

  53. A pioneer in the design of institutions, and in thinking imaginatively about design in science, is Robin Hanson. See, for example, his webpage about Alternative Institutions, or his recent short essay about Intellectual Prestige Futures (2022).↩︎

  54. We've been to many metascience workshops. They're often surprisingly monocultural: mostly science funders, or mostly economists, or mostly non-scientist founders of new research organizations, or mostly active scientists. One resulting conviction: a large (but not dominant) subgroup at any such workshop should be active, experienced scientists, to act as a check on the wilder flights of fancy of non-scientists. More generally, with a proto-field it seems likely to us that extremely robust discussion between groups with very different points of view is likely to be generative, and a good check on the myopia of any individual group. We include ourselves in this diagnosis of myopia.↩︎

  55. Indeed, it would work best if there were a great diversity in ambient environments, and scientists could move liquidly between environments in which high risk was expected or in which less risky work was expected. We will return to this theme later in a much more general context.↩︎

  56. It's too early to be sure, but plausibly DeepMind ought to be viewed as an exception, with injections of rapidly-increasing quantities of capital coming in response to scientific success. It is, nonetheless, at most an approximate example, since the quantities of money injected were only approximately driven by scientific success: arguably the long-term commercial potential of AI is the principal driver, and that is perhaps only loosely related to the science. Still it's a very unusual and striking example.↩︎

  57. Historically, the structure has most often been attributed to Watson and Crick. However, as has subsequently become clear, Franklin's data played a crucial role and it seems at least reasonable to jointly attribute the structure. Indeed, complicating the matter further, an important part of that data was obtained by a student working with Franklin, Raymond Gosling.↩︎

  58. It is not difficult to find online forums full of people who will explain in loud detail how the scientific establishment routinely suppress all daring, unconventional ideas etc. The very hold this cliched wisdom has over the imagination is, of course, an expression of the fundamental point: the scientist-rebel is one of the standard archetypes for the scientist hero, and our institutions reflect that (while also warring with tendencies to favor incumbents). It's curiously similar to the celebrated role of the rebel outsider in American culture: it would seem almost tautologically true that a rebel ceases to be a rebel when they become an icon. But from James Dean to Apple's "Think Different" to many modern examples, it has a tremendous (and largely healthy) hold over the imagination of both scientists and the public.↩︎

  59. Sharp-eyed readers will notice that this is not, in fact, a loop. But the core is of course an iterative loop, and we hope you shall forgive us this descriptive infelicity.↩︎

  60. Daniel S. Greenberg, Chance and Grants, The Lancet (1998).↩︎

  61. Indeed, the preferences of the grant agencies often propagate back to influence the behavior of grant-hungry institutions.↩︎

  62. We would be remiss not to mention Tal Yarkoni's beautiful essay No, it's not The Incentives—it's you (2018), which argues that very often people use "the system's incentives" as an excuse not to make change. But the essay is speaking to people's individual decision making and to their (both individual and collective) values. As a practical matter systemic incentives matter a great deal for collective behavior. They do not dictate behavior, as the essay emphasizes, but they do influence it, mediated by those values.↩︎

  63. A favorite: "Would you like to give a talk in our seminar series?" "Yes, sure, I'd love to talk about some work on […]". "Our seminar slot runs from noon to 3pm on Friday". "Oh, I only have material for a standard hour-long talk." "Perfect!" (The audience liked to get raucously involved in the talk.) This is a transcription of a story MN heard, from the principal; but he has also personally experienced something similar. On the most memorable such occasion, his "hour-long" talk eventually stretched to 6 hours over two days. This was certainly not due to his "running over", at least not in any standard way. Rather, it was due to the extraordinary intensity of involvement of the audience. Rarely has he spent so much time seated and silent during his "own" talk, while audience members took over at the whiteboard, trading ideas, arguing, developing lines of reasoning. One-on-one and small-group conversation is often very intense in good research organizations, but this kind of large-group seminar culture is unusual, in our experience.↩︎

  64. When we make this observation, people often respond with non sequiturs. Common responses are things like: "Well, of course, those places are where all the top people are", as though top people are somehow a fixed, immobile class, incapable of movement. Do you think that between 2000 and 2010 there was any migration of talent from Microsoft to the (non-existent in 2000) Facebook? In a dynamic environment, top people move a lot. And not just individually, but in aggregate.↩︎

  65. The material about Kariko is based on articles by: Damian Garde and Jonathan Saltzman, The story of mRNA: How a once-dismissed idea became a leading technology in the Covid vaccine race (2020), and: Gina Kolata, Kati Kariko helped shield the world from the coronavirus (2021).↩︎

  66. See: Ariana Remmel, How a historic funding boom might transform the US National Science Foundation (2021).↩︎

  67. Alexander Berger has made a striking reverse argument to us privately: that the scale of the NIH (and the doubling) created space on the edges for researchers such as Kariko to (just!) eke out a living. It's a very interesting argument, and would be worth considering as an argument for status quo in a serious post mortem.↩︎

  68. For an account and further references, see: Douglas C. Prasher, Using GFP to see the light (1995).↩︎

  69. Aaron Gouveia, Shuttle driver reflects on Nobel snub (2008).↩︎

  70. Dan Charles, Glowing gene's discoverer left out of Nobel Prize (2008).↩︎

  71. Decca Aitkenhead, Peter Higgs: I wouldn't be productive enough for today's academic system (2013).↩︎

  72. Sydney Brenner, Nature's Gift to Science (2002).↩︎

  73. Gina Kolata, Grant System Leads Cancer Researchers to Play It Safe, The New York Times (2009).↩︎

  74. It is, in fact, easy to find significant examples in the early days of modern science. In the 17th through 19th centuries there were plenty of institutional green fields and social process white space. But over the 20th century and the early part of the 21st it has gradually became more difficult. In general, white space is far easier to impact than areas where existing institutions operate.↩︎

  75. As we shall see, this isn't because the problem is peculiar to social psychology; rather, it's because social psychologists were the first to do anything about it. The replication crisis has since "spread" to many other disciplines. We put spread in scare quotes because, of course, it has merely revealed a pre-existing underlying problem; what has spread is evidence and awareness of the problem, not the problem itself.↩︎

  76. The Open Science Collaboration, Estimating the reproducibility of psychological science, Science (2015).↩︎

  77. Benedict Carey, Many Psychology Findings Not as Strong as Claimed, Study Says (2015).↩︎

  78. Daniel T. Gilbert, Gary King, Stephen Pettigrew, and Timothy D. Wilson, Comment on "Estimating the reproducibility of psychological science", Science (2015).↩︎

  79. See, for instance: Ewen Callaway, Report finds massive fraud at Dutch universities, Nature (2011)↩︎

  80. Daryl J. Bem, Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect, Journal of Personality and Social Psychology (2011).↩︎

  81. Experiments often involve thousands of decisions. There is enough latitude that even honest, conscientious scientists can easily fool themselves. Tal Yarkoni put it well in his review ([[http://www.talyarkoni.org/blog/2011/01/10/the-psychology-of-parapsychology-or-why-good-researchers-publishing-good-articles-in-good-journals-can-still-get-it-totally-wrong/][The psychology of parapsychology, or why good researchers publishing good articles in good journals can still get it totally wrong]], 2011) of Bem's paper: "[T]his might seem like a pretty damning review, and you might be thinking, boy, this is really a terrible paper. But I don't think that's true at all. In many ways, I think Bem's actually been relatively careful. The thing to remember is that this type of fudging isn't unusual; to the contrary, it's rampant – everyone does it. And that's because it's very difficult, and often outright impossible, to avoid."↩︎

  82. Stéphane Doyen, Olivier Klein, Cora-Lise Pichon, Axel Cleeremans, Behavioral Priming: It's All in the Mind, but Whose Mind?, PLoS One (2012).↩︎

  83. Daniel Kahneman, "Thinking Fast and Slow" (2011).↩︎

  84. See Kahneman's open letter to colleagues, in 2012. It is also worth reading many followups to the events, including: Ulrich Schimmack, Moritz Heene, and Kamini Kesavan, Reconstruction of a Train Wreck: How Priming Research Went off the Rails (2017).↩︎

  85. The first occurrence we're aware of is in: Harold Pashler and Christine R. Harris, Is the Replicability Crisis Overblown? Three Arguments Examined, Perspectives on Psychological Science (2012).↩︎

  86. Paul E. Meehl, Why Summaries of Research on Psychological Theories are often Uninterpretable, Psychological Reports (1990, reprint of a 1985 article).↩︎

  87. Anthony G. Greenwald, Consequences of Prejudice Against the Null Hypothesis (1975).↩︎

  88. Registered reports were proposed in: Brian Nosek and Daniël Lakens, Registered reports: a method to increase the credibility of published results, Social Psychology (2014). Other solutions to these problems might also be proposed, of course. A common suggestion, which we've heard going back nearly 30 years, is that some journals should specialize in publishing null results. Unfortunately, it seems likely those journals would be considered low status, a place to publish unimportant work. That's not much of an incentive to publish! As we'll see, part of the beauty of Registered Reports is that they co-opt existing sources of prestige, in service of improved standards of evidence. However, they don't solve the entire issue: scientists are (understandably) most interested in surprising non-null findings, and we'd expect such findings to continue to garner more long-term attention. But Registered Reports do help us determine whether a non-null finding is true or not. This point is made in: Brian A. Nosek, Jeffrey R. Spies, and Matt Motyl, Scientific Utopia: II. Restructuring Incentives and Practices to Promote Truth Over Publishability, Perspectives on Psychological Science (2014).↩︎

  89. Brian A. Nosek, Tom E. Hardwicke, Hannah Moshontz et al, Replicability, Robustness, and Reproducibility in Psychological Science, Annual Reviews of Psychology (in press, 2021). Note that the fourth graph actually combines results from several different multi-site replication studies, and so we're being somewhat liberal in describing this graph as showing the results of "a study". Details may be found in the just-cited paper.↩︎

  90. John Protzko, Jon Krosnick, Leif D. Nelson et al, High Replicability of Newly-Discovered Social-behavioral Findings is Achievable (2020).↩︎

  91. One unfortunate consequence of the replication crisis is that it's led to lots of psychology-bashing from online critics. In fact, when a group of people realize they're making an error and begin taking steps to address it, that's laudable. We suspect many other sciences have the same problems, but haven't yet realized it or begun taking steps to solve the problem.↩︎

  92. Major disruptions in business are also often associated to crises. Netflix created a crisis for Blockbuster; Napster and successors such as Spotify created a crisis for the music labels. And so on. Such a crisis is often a signifier of a major improvement underway. Of course, the mechanisms are fundamentally different in the two cases. But there is also a surprising amount in common. Incidentally, in this connection: there is a literature on whether the replication crisis is "neoliberal". It's a rather curious literature, in some regards, but stimulating. See, for instance: Duygu Uygun, Mehmet Necip Tunç, and Ziya Batuhan Eper, Is Open Science Neoliberal? (2022), and citations within.↩︎

  93. It is perhaps not immediately obvious how deep an idea Registered Reports are. In fact, prior to Registered Reports many people had proposed variations on (and in some cases, made attempts to create) journals that would publish null results. This is a simpler idea that appears to address many of the same underlying issues. But such journals would tend to be very low status, and unlikely to attract high quality submissions. Registered Reports are a much better design that solves the underlying problems. This was pointed out in: Brian A. Nosek, Jeffrey R. Spies, and Matt Motyl, "Scientific Utopia: II", Perspectives in Psychological Science (2013). Of course, the idea isn't completely original – pre-registration of designs for medical trials had been pioneered much earlier. But the analysis in that paper is nonetheless a nice example of metascience design.↩︎

  94. It's striking that the builder of the first telescope is not remembered by most scientists, but Galileo is. The usual view is: Galileo made the scientific discoveries, but the toolbuilder did not. But they did enable discovery. This is an early example of a pattern that persists to this day. It's beyond the scope of this essay to delve deeper, but fascinating to think upon.↩︎

  95. An early draft proposal to the NIH may be found here.↩︎

  96. Indeed, one of us (MN) worked fulltime as a metascience entrepreneur from 2007 through 2012, developing and promoting open science.↩︎

  97. Some important changes in the social processes of science come about without much or any involvement of metascience entrepreneurs. For example, as far as we are aware, no-one at the NIH decided to increase the age at which NIH awards grants to grantees; it seems likely to rather be an emergent consequence of other decisions. (We owe this point to: Thomas R. Cech, "Fostering Innovation and Discovery in Biomedical Research", Journal of the American Medical Association (2005).) Still, many of the most important deliberate improvements in the social processes of science have come from the actions of one or a few individual metascience entrepreneurs. We further believe that there is tremendous scope to increase the impact of such metascience entrepreneurs. That's why we've focused on them.↩︎

  98. Note that as our examples suggest we do not think of metascience entrepreneurship as necessarily having a commercial component.↩︎

  99. Of course, he's also given to making such overconfident pronouncements!↩︎

  100. With that said, it is common on Twitter and in-person to hear well-known tech VCs and CEOs critique academia, often in the face of events like the replication crisis. Critique is far easier than creation, and confidence is not always correlated with ability.↩︎

  101. A good parallel example of this is the rise in startup entrepreneurship, driven in part by the large body of writing about startup entrepreneurship, and the emergence of communities of practice.↩︎

  102. There really needs to be a history and ethnography of metascience entrepreneurship.↩︎

  103. See: Declan Butler, "Los Alamos loses physics archive as preprint pioneer heads east", Nature (2001).↩︎

  104. These patterns typically pass relatively quickly. For instance, fear of being scooped may pass as the arXiv actually becomes a means of claiming priority, and researchers may rush to arXiv publication to avoid being scooped. Also: many journals have initially feared the arXiv as a potential competitor, but typically come to accept (if only begrudgingly) the role it plays in scientific communication.↩︎

  105. In this, arXiv was very similar to social networks like Facebook, and some other venture capital-backed startup companies, which pursued an expansion strategy, starting narrow and gradually branching into adjacent communities. This is also a pattern that was noted in early work on the solution of collective action problems, by researchers such as Mancur Olson, "The Logic of Collective Action" (1971).↩︎

  106. Note that Braben publicly laid out the BRAVERI score in a paper in Physica A in 2002, after BP's Venture Research Unit had been disbanded. He claimed that most of what BP had funded would score 50-100% on the BRAVERI criteria. A later, smaller iteration of Venture Research has been run at University College London, and Braben has set out a revised version of the BRAVERI score in explaining the UCL program: Don Braben, "Introduction to the UCL Provost’s Venture Research Fellowship scheme" (2019).↩︎

  107. See, e.g.: Donald Braben, "Scientific Freedom: The Elixir of Civilization" (2008).↩︎

  108. One of us (MN), while not exactly hostile, was initially skeptical that replication was that important a problem in social psychology. He's gradually moved from skepticism to extreme enthusiasm for the replication movement, convinced by the strength of the work done on the replication crisis.↩︎

  109. It's worth noting that many of the arguments made against seem more like reasoning motivated by pecuniary interest than the interest of science. But there are some good faith arguments. For example, the science journalist John Bohannon did an investigation suggesting that many fee-charging open access journals had extremely low quality editorial standards. This was attacked by many open access advocates, but there was considerable merit to Bohannon's work: John Bohannon, "Who's Afraid of Peer Review?", Science (2013).↩︎

  110. Pierre Azoulay, Turn the scientific method on ourselves, Nature (2012).↩︎

  111. See, for a review of the early work of the Research on Research Institute: Sandra Bendiscioli, Teo Firpo, Albert Bravo-Biosca, Eszter Czibor, Michele Garfinkel, Tom Stafford, James Wilsdon, and Helen Buckley Woods, The experimental research funder's handbook (2022).↩︎

  112. See the announcement here.↩︎

  113. See the proposal here.↩︎

  114. James A. Evans and Jacob G. Foster, Metaknowledge, Science (2011).↩︎

  115. James McKeen Cattell, "American Men of Science" (1910).↩︎

  116. See: Robert May, "The Scientific Wealth of Nations", Science (1997), and: David A. King, The Scientific Impact of Nations, Nature (2004).↩︎

  117. Merely listing pertinent recent review papers and books would be an extensive undertaking. Good starting points for obtaining familiarity include the recent book: Dashun Wang and Albert-László Barabási, "The Science of Science" (2021); and the review paper: Santo Fortunato, Carl T. Bergstrom, Katy Börner, James A. Evans, Dirk Helbing, Staša Milojević, Alexander M. Petersen, Filippo Radicchi, Roberta Sinatra, Brian Uzzi, Alessandro Vespignani, Ludo Waltman, Dashun Wang, and Albert-László Barabási, "Science of Science", Science (2018).↩︎

  118. Lingfei Wu, Dashun Wang, and James A. Evans, "Large teams develop and small teams disrupt science and technology", Nature (2019).↩︎

  119. Benjamin F. Jones, "Age and Great Invention", The Review of Economics and Statistics (2010).↩︎

  120. Caroline S. Wagner and Jeffrey Alexander, "Evaluating transformative research programmes: A case study of the NSF Small Grants for Exploratory Research programme", Research Evaluation (2013).↩︎

  121. One part of the problem is that it's tempting to concentrate our attention on papers which tell the story we want, rather than to change our beliefs about the world in the light of compelling new evidence. Few of us think we do this; almost all of us do. It's difficult to escape because how we evaluate evidence is mediated by a community, who determines much about what we see, and how we see it. So even if our individual reasoning is unusually careful and honest, we are still strongly influenced by our community's standards. Put another way, if even one of our colleagues – perhaps a person we've never met, in some far distant part of our network – is a little less fair-minded then they ought to be, then they actually negatively affect our ability to evaluate evidence. In more detail: how we evaluate evidence is not an individual property, it's a collective property. Consider the following simple model: suppose there are two strong papers on a policy topic you're interested in. One paper, X, reaches a conclusion you like, while paper Y reaches the opposite conclusion. Suppose paper Y actually has stronger evidence. You, being a serious- and fair-minded person, will realize if you deeply read both papers that paper Y has better evidence, and re-evaluate your opinions. This all seems fine and good. But there's a problem. Your friends and colleagues, who likely share your policy opinions, are naturally excited about paper X. They pass it around; you're likely to see it. Meanwhile, if they come across paper Y they're just not so excited, and a little less likely to pass it on, making you much less likely to ever see paper Y. Collectively, paper X rapidly passes through much of your network, and is absorbed as another piece of evidence in favor of that position; paper Y is heard of only by a few exceptionally wonkish people. (In this sense how we absorb knowledge is a bit like how we catch an infectious disease: whether you catch a disease depends in part on your actions, but also to a considerable extent upon the behavior of your broader social network.) This is a simple model, and of course there are many ways it can be modified so the conclusions don't hold. But it points at a real problem: in many respects we don't reason individually, we reason collectively. What we pay attention to, and the kind of attention we pay, is mediated by our networks. Worse, it is easy to forget this and to arrive at an illusion of fair reasoning because we ourselves are exceptionally fair. But it's not enough!↩︎

  122. See, for example, this article. As far as we know, Pritchett hasn't used this term in his published work, although he certainly make arguments germane to it, e.g., in: Lant Pritchett, "It pays to be ignorant: A simple political economy of rigorous program evaluation" The Journal of Policy Reform (2002).↩︎

  123. The alternative is for metascientific work to become principally the purview of science bureaucrats who think they're making major improvements, when really they're making tiny changes or making things worse. Indeed, we believe many of the easiest improvements will be things in the blind spot of existing organizations, often things which make them uncomfortable. Things like: taking irrelevantly long grant applications and making them quick and easy. Speeding up grant cycles. Doing things which are illegible or a little disreputable or politically unpopular. And so on. In all this, we're firmly on the side of the outsiders.↩︎

  124. This point is made in: José Luis Ricón, “New Science's NIH report: highlights”, Nintil (2022-04-25), available at https://nintil.com/new-science-nih/.↩︎

  125. Note that this is different from the standard meaning for chronoscope, which is a device to measure small time intervals.↩︎

  126. Michael Nielsen and Kanjun Qiu, "The trouble in comparing different approaches to science funding", https://scienceplusplus.org/trouble_with_rcts/index.html, San Francisco (2022).↩︎

  127. In particular: (1) the non-stationary nature of metascience: lessons learned in the past may not continue to hold in the future; and (2) the need to learn from single outlier examples. Also related is whether we can expect metascience to be a science at all: why should we expect there to be underlying laws or principles at all? This is discussed in: Michael Nielsen, "In what sense is the science of science a science?", https://scienceplusplus.org/what_is_sos/index.html, San Francisco (2022).↩︎

  128. When we discuss this with colleagues, they sometimes object that the bulk of the curve also matters, and should not be ignored. We wish to be clear: we are not arguing for ignoring the bulk, but merely for taking seriously both the bulk and the outliers. An interesting, albeit much stronger point of view – one whose truth we are very uncertain of – is that if we prioritize outliers, then everything else will take care of itself. This is not a popular view with many scientists! But it may have the decided upside of being true. Note that it is strongly related to the ongoing argument over the Ortega and Newton hypotheses, c.f.: Jonathan R. Cole and Stephen Cole, "The Ortega Hypothesis", Science (1972).↩︎

  129. We don't have detailed long run sources of data for the Bell Labs research budget. But three sources are at least suggestive. One is the Office of Technology Assessment's report "Information Technology R&D: Critical Trends and Issues" (1985). This estimates a $2.1 billion spend on R&D in 1982, with roughly 10% of the R&D budget on research, so $210 million in 1982, or $660 million in 2022. A second source is this post: Brian Wang, "Comparing Research budgets of 1970s Bell Labs to DARPA and Google Today", https://www.nextbigfuture.com/2015/08/comparing-research-budgets-of-1970s.html (2015). It estimates $500 million in non-military R&D in 1974, or $3.1B in 2022. Of course, that's an over-estimate since it includes all R&D, not just research. Furthermore, many sources give an impression of an organization whose peak years were the 1940s through 1990s. Perhaps most notable is: Jon Gertner, "The Idea Factory" (2012).↩︎

  130. Another example to illustrate the point, albeit less directly pertinent, comes from venture capital (VC). This is outlier dominated – see, for example: Abraham Othman, Startup Growth and Venture Returns (2019). Again, scale may help in VC: simply having the resources to make a much larger number of relatively independent bets may give one VC an enormous advantage over another, even if they are actually using an otherwise inferior strategy. The additional scale gives them a much higher chance of achieving catastrophic success, despite their strategy.↩︎

  131. Eugene Garfield, "Is citation analysis a legitimate evaluation tool?", Scientometrics (1979).↩︎

  132. The analogous argument for startup companies has been eloquently made in: Paul Graham, Do Things That Don't Scale (2013).↩︎

  133. Abraham Pais, "Subtle is the Lord: The Science and the Life of Albert Einstein" (1982).↩︎

  134. Note that this kind of approach certainly does not mean uncritically accepting scientists' myths about themselves. But there is benefit in understanding science on its own terms, as well as a more outsider perspective.↩︎

  135. One idea we toyed with while writing this essay was that of a tenure fellowship, meaning tenure but with the ability to move immediately and with no administrative friction between different universities. Making such an idea work is complicated by some logistical facts (moving a lab may be difficult), as well as a market-making problem: some universities are more attractive locations than others; some professors are more attractive as members of faculty than others. Still, we believe there may be some version of the idea which is workable, and which adds liquidity to the usually very illiquid market in tenured faculty.↩︎

  136. In Part 2 we've been focused on scaling social processes that are working. But there's some tension with the desire for structural diversity. Perhaps, if outlier successes are what matter, then it's sufficient to have many small but very different cultures. In this view, scaling may not matter so much. Indeed, maybe scale is even undesirable. Cf the problems discussed in this Appendix.↩︎

  137. See this Twitter commentary by Stuart Buck, and the associated essay: Stuart Buck, "'Evidence-based' Philanthropy Gone Wrong: The Myth of How Small Schools Failed", https://www.insidephilanthropy.com/home/2014/12/22/evidence-based-philanthropy-gone-wrong-the-myth-of-how-small.html (2014).↩︎

  138. We don't know Langer or Church, and they may disagree. Certainly, both have encountered significant problems in getting their own work supported. But while those problems are no doubt very real, they appear to pale beside the issues faced by many others, included people such as Katalin Kariko, Stephen Wiesner, and Douglas Prasher, mentioned later in the paragraph as contrasting examples of non-canonical scientists.↩︎

  139. A stimulating simple model suggesting some of the limits of "the free play of free intellects" idea may be found in: Erich Kummerfeld and Kevin J. S. Zollman, "Conservatism and the Scientific State of Nature", The British Journal for the Philosophy of Science (2016). Their argument is rather different to ours: it's an idealized microscopic model of how people choose what to work on, arguing that individuals will tend to under-invest in risky projects, if left to their own devices. Still, the model is complementary, in the sense that these are precisely the kinds of issues that can be addressed by culture and institutions.↩︎

  140. Put another way: the value of unique environments is that they produce particularly unusual individuals, individuals capable of actions no-one else is capable of. Such an environment does not feel like anywhere else. There is obviously some connection between the uniqueness of such an environment (and the concomitant emotional experience) and the network structure and social processes. Those network structures must be genuinely different in some important way. But understanding such connections is far beyond the scope of this essay. We will, however, note again that there is some tension between scale and uniqueness.↩︎

  141. Insightful early explorations of this idea may be found in: Irving John Good, Speculations Concerning the First Ultraintelligent Machine (1965); and: Vernor Vinge, The Coming Technological Singularity: How to Survive in the Post-Human Era (1993).↩︎

  142. The benefit to humanity depends in part on the future course of those countries. China's totalitarian approach is deeply discouraging; we are encouraged by the thought that totalitarian governance is likely net bad for their science and technology.↩︎

  143. Recent surveys in this direction include: Andy Matuschak and Michael Nielsen, How can we develop transformative tools for thought? (2019); and: Bret Victor, Media for Thinking the Unthinkable (2013). See also: Douglas Engelbart, Augmenting Human Intellect: A Conceptual Framework (1962). These omit discussions of brain-computer interfaces, but that is of course also of interest in this vein.↩︎

  144. The code may be found here.↩︎

  145. In the world of technology startups it's like asking a startup founder if their idea is better than the competitors. Of course, the answer must always be "yes!" But it doesn't make it so.↩︎

  146. At the level of individual researchers and their networks this effect has been studied in: Pierre Azoulay, Joshua S. Graff Zivin, and Jialan Wang, "Superstar Extinction", The Quarterly Journal of Economics (2010).↩︎