In what sense is the science of science a science?

By Michael Nielsen (Released Jan 13 2022)

An attempt to articulate a problem I'm having in understanding how generalization occurs in the science of science. Very roughly speaking: it's about the idea that the science of science will necessarily involving converting large parts of epistemology from philosophy into a science. Rapidly written thinking-out-loud – corrections, further thoughts, and pointers to related work welcome (please leave comments at the bottom).

A nexus of disciplines has emerged over the past couple of decades, going variously by the names science of science, metascience, and meta-research. It's an emerging area, overlapping strongly with many older disciplines1, and different people define it somewhat differently. One common theme and description is to turn the scientific method on ourselves 2.

I got curious about the term: the "science of science"? Certainly, there are many interesting empirical questions that can be asked about the practice of science. Techniques from disciplines like sociology and economics can and are being used to understand things like group dynamics in science, behavior in response to incentives, and so on. In that sense, the term "science of science" makes a great deal of sense. But I also had a niggling sense that the science of science might run into some pretty strange, perhaps even unprecedented, epistemological questions, when considered as a science. The situation reminded me a little of Goedel's (and, to some extent, Turing's) work, developing a way of reasoning mathematically about the foundations of mathematics. And I also had an instinct of some connection to Hume's problem of induction. This informal essay attempts to articulate those questions.

A relatively recent review of the science of science is by Fortunato et al in Science in 20183. The paper contains many striking graphs, (mostly) drawn from earlier papers. As a typical example, here's a fascinating graph showing when the most highly cited paper occurs in a scientist's full sequence of papers4 (from a sample of 10,000 scientists).

The graph at first looks boring – nothing changes! – but it's the very regularity which is interesting: the most highly cited paper appears to be equally likely at any point in a scientist's publication record! And this perhaps contradicts some folk ideas: about the energy of youth; or the wisdom and advantages of age; or perceived benefits and drawbacks of different career stages.

What can we learn from a graph like that shown above? Are there any generalizable lessons? What, in short, does the graph mean?

Perhaps the most obvious naive response would be to accord it something like a "law of the science of science" status: a scientist is equally likely to write their most highly cited paper at any point in their sequence of publications.

Obviously any such "law" would seem pretty fragile. It's not difficult to think up scenarios that would (plausibly) cause such a "law" to break down.

For instance, consider a society where scientists received much poorer medical care than most scientists in our society are accustomed to. It is at least plausible that in such a society there might be a gradual decline in the graph above, as people's capacity to do scientific work declined with their health. I'm not saying I can prove this, just that if someone were trying to claim "law" status then the onus would be on them to have an explanation as to why this kind of change didn't occur.

To put it another way, the shape of that graph is plausibly influenced by exogenous factors like health, health care, and many others. If you're tempted to treat the even regularity of the graph as some kind of general law, you'd need a pretty good argument for why such exogenous factors wouldn't influence the shape of the graph5.

But it's not just exogenous factors that would plausibly influence the shape of the graph. There are also factors endogenous to science that plausibly influence the shape. Some of those, of course, are social (and amenable to study using standard techniques). But – and this is what I am particularly curious about – it also seems as though the shape of the graph plausibly reflects something about the structure of what remains to be discovered. That is, the shape of the graph plausibly reflects something about what we might call, for lack of a better term, structural epistemics.

Suppose, for instance, that a scientist begins their career working in a hot field. They obtain many great results, and write many important papers. And then the field dries up. This happens sometimes – the main problems of the field may be solved, a new technique or approach may make the field obsolete. If this happens it may be difficult for the researcher to switch to a more productive field – after all, they have no track record in any new field they'd like to switch to, and so any grant applications are likely to be turned down. They might end up routinely publishing much less well cited papers. This kind of effect would contribute to the graph bending down; presumably, there must be other counterbalancing effects causing it to bend up.

The reason I mention this is because it involves in a curious way a fact about the structure of undiscovered knowledge: the fact the field dried up. This is rather striking. It means that any generalization about the shape of that curve necessarily reflects something (however mild, and however many other factors might be involved) about the structure of humanity's undiscovered knowledge. And so insofar as you might hope to claim that curve as a "law" of the science of science, you're therefore making (in part) an assertion about the structure of undiscovered knowledge.

I realize this isn't a very strong argument: it's a weak connection I've drawn, and even that weak connection is only meant to be illustrative, not ironclad. For this specific graph it's possible no structural epistemics are involved. But this same pattern recurs, over and over again: many things you might plausibly hope for as generalizable results in the science of science depend, in some way, on the structure of humanity's ignorance. It seems to me very striking that the science of science involves, in some sense, reasoning about that ignorance.

Of course, scientists routinely reason about humanity's ignorance in a very heuristic way. They have taste in what problems they explore (and ignore); about what directions they find promising; about what directions they think have dried up; and so on.

Think of Dirac: "A theory with mathematical beauty is more likely to be correct than an ugly one that fits some experimental data." Or: "If one is working from the point of view of getting beauty into one's equation, … one is on a sure line of progress."

Or consider the way Freeman Dyson chose what to work on (also, coincidentally, involving Dirac in a cameo role). According to James Gleick6, while Dyson was an undergraduate: "One day an assistant of Dirac's told Dyson, `I am leaving physics for mathematics; I find physics messy, unrigorous, elusive.' Dyson replied, `I am leaving mathematics for physics for exactly the same reasons.'"

It's easy to find a thousand examples of taste like this: heuristic decisions made using some combination of intuition, study of history and philosophy of science, personal ability, personal aesthetics, and so on. But although these decisions are made carefully, they're not being made using any kind of scientific theory. We don't have a scientific theory that lets us reason about what science doesn't yet know.

This line of thinking began with one prosaic graph, and asking the question: what do we learn from it? What does it mean? And the more I considered, the more it seemed any meaning for the graph must be connected to the structure of humanity's current ignorance; not what we know, but rather what we don't know, and what we might plausibly come to discover. Those aren't things we control; nor are they things (by definition) we can anticipate in advance, though perhaps we might have some hints7.

Comparing funding schemes

Okay, that's one example. But there's another example which is more important, and perhaps illustrates the issue more clearly.

Suppose you decide you want to compare two funding schemes against one another. Say, a lottery versus a more conventional panel-based approach. You convince a big funder to do a careful comparison of the two, and lotteries do better.

Never mind for now that it's hard to say when one is "better" than the other. We'll come back to that in a bit. For now, let's just suppose you've somehow made a good comparison, and you're convinced: lotteries did better than panel-based funding.

What should we conclude from this?

Obviously, this doesn't mean then that you should generalize to "lotteries are better". But suppose you did another comparison study, and lotteries were better again. And another comparison, and lotteries were better again.

What can you conclude now?

Well, maybe all those comparisons were done in biology. And then someone does one in chemistry, and oops, nope, turns out panel-based does better. Then someone does another comparison in chemistry, and panel-based again does better.

Should we now conclude that lotteries are better in biology, and panel-based approaches are better in chemistry?

Oops, no, now someone does another comparison, and it seems that lotteries were only better in some parts of biology. They work better than panels when studying large-scale systems, zoology, and so on. But for molecular biology, panel-based works better.

So should we now conclude that it's really studying things at the molecular level which matters (use panels), versus studying larger systems (use lotteries)?

Oops, wrong again, someone does the comparison again ten years later, and it turns out the molecular level now benefits from lotteries, and large systems from panels.

This might seem an unlikely line of thought experiments. But many people, including me, are interested in more comparisons like this being done. Such comparisons are only going to be much use if we know something about how to generalize. You don't (directly) learn much at all from any single comparison. To learn anything, you need to work toward a deeper explanation of the causal factors that might make one approach better than the other, and under what circumstances.

Another way of looking at this is that it's really about the question: how can we generalize these results so we start to have an explanatory theory of how best to fund science?

To get a better grip on this question it's helpful to think about similar situations elsewhere.

Suppose we're trying to test how changes to minimum wage affect unemployment. Perhaps you do some careful studies in Kentucky, and conclude that a $5 per hour hike in the minimum wage increases unemployment by 1%.

What do you really learn from this? What, if anything, does it mean in California (say)? What, if anything, does this mean for Kentucky in 10 years time?

To answer those questions at all convincingly your study will only play a small part. It must be embedded in a much broader understanding of the causal factors relating wages and unemployment.

So maybe you say "Oh, the study was done during a pandemic, people are much more desperate to cling onto work than is usual, so it probably doesn't tell us anything about Kentucky in 10 years time."

Or you might say "Well, cost of living in California is [such-and-such], so let's adjust in [so-and-so a way]. And the median wage is [such-and-such], and so we should adjust in [so-and-so a way]". And maybe make some other adjustments as well, based on reasonable plausible models.

Now, obviously this kind of approach has many challenges.

Most obviously: it can easily fall victim to motivated reasoning, especially in cases (like this one), where a lot of people bring a lot of emotion and strong previous convictions.

But if we work at it hard enough, and keep generating deeper and deeper explanations, and testing them, and cross-checking them, then human beings have often done a pretty good job at figuring out how to generalize this kind of result. Not, I stress, from single point comparisons. But rather from large webs of experiments, built as we conjure up, test (and frequently amend or discard or improve) different lines of explanation.

Now, can we do this with the funding example?

In principle, there seems to be no reason why not. We're a long, long way from being able to do so, in my opinion8 – at the moment we have folk theories and just-so-stories, not a powerful explanatory framework. But in principle it seems likely to me that we can do a lot better.

There are at least two major caveats, however.

The first caveat is a tangent from the remainder of this essay. But it's an interesting and under discussed tangent, so I'll describe it. It's this: in the economics example, you're interested in aggregate effects, and typical behavior. You want to understand statistics like the behavior of the median and the average, and of different cohorts.

By contrast, in science it's plausible that you may be more interested in the extreme outliers, enormous breakthrough discoveries. Indeed, you may be willing to tolerate the mean or median (or even 95th or 99th percentile) outcomes getting much worse if that results in bigger outliers.

Some people would argue this is wrong, and that outliers are overrated in science. But I think it's plausible. In this sense, science is more like airline safety or venture capital than it is like the economic example. Science, airline safety and venture capital all share two features in common: (1) outlier events are plausibly the main things you want to optimize for; and (2) the outliers tend to have non-recurring features, because when they occur, the system changes in response (the resulting challenge: how to avoid overfitting to "Mark Zuckerberg does [such-and-such]"; or "Bell Labs did [so-and-so]").

This is a difficult situation to reason about.

I should say, by the way: it's also difficult to know whether it's truly the outliers which dominate in science. Suppose you take citations to a paper as a quantification of how valuable a discovery is (as dubious as that measure is). Analyses suggest9 that the probability distribution of citations C behaves something like 1/C3.16. This kind of distribution requires considerable care to analyze – certainly, intuitions built up from studying the normal distribution will not be entirely reliable – but it is not actually dominated by the outliers. So there's a case to be made against the outliers-dominate folk theory.

On the other hand, scientific discoveries are not fungible; indeed, arguably it is a mistake to quantify their impact at all. Alan Kay has pointed out10 that you can't pile up a bunch of smart people to produce one genius. Insofar as this is correct it points more toward the outliers-dominate theory. I won't come to any conclusion here, I just want to emphasize that there is a genuine issue. I expect to come back to this non-recurring outlier problem in the near future, in collaborative work with Kanjun Qiu.

As interesting as this all is, we've diverged from the main line of this essay. Recall: we're talking about how to generalize from comparisons of two funding models, say, lotteries and panels. And we're pondering what to learn from an analogous situation in economics, where experiments have been done to determine the relationship between minimum wage and unemployment.

In the economics example, to make progress you need to know a lot apart from your experimental results – you (plausibly) need ideas like cost-of-living, median wage, and many others, as well as reasonable models of how they relate to the experiment. The stronger this surrounding web of understanding, the more sense we can make of the experiments. If the surrounding web of understanding is weak, the experiments mean very little; if the surrounding web of understanding is very strong, the experiments may mean a lot.

Suppose you're trying to do something similar in the funding example. You might end up making an argument like: "well, in studying molecules we used to be in sore need mainly of data [you'd need to state evidence for this], and so a panel-based system was good, since experts could do a great job evaluating who'd be best at that [need evidence]. But now we have lots of data [more evidence], and weak ideas about how to understand it [you get the idea], and what's needed is lots of wild exploration, and lotteries help in getting that wild exploration."

Forget the details of the argument or even whether it's true (I made it up after 10 seconds of thought). Rather, just consider the structure. It's in part a sociological argument, about the types of decisions groups tend to make. That kind of thing has been extensively studied, and there's no reason it couldn't be studied here, using similar techniques. But it's also an argument about what types of exploration are best in response to humanity's current state of knowledge. It is, in some sense, a theory of discovery, and an assertion about the adjacent possible.

What makes my spidey sense tingle is that the objects in any such theory are (in part) a hypothetical space of possible discoveries, of possible explanations of the world. I called it a theory of discovery just above, but it might equally well be called a theory of the unknown, or theory of exploration, or theory of theories11. Of course, some of the objects of any such theory would also be amenable to more standard descriptions: things like exploration strategies, or group dynamics. But some would be a lot stranger: currently unknown types of explanation, currently unknown types of theoretical entity.

I've had a lot of trouble articulating what bothers me, and the above succeeds at most partially. Still, the situation in the science of science seems mostly unprecedented to me – and, for that reason, very exciting. As mentioned in the opening, the most similar thing is perhaps what Goedel, Turing et al did in the foundations of mathematics, and perhaps there are things to learn there. And there are also some similarities to the philosophy of science, although I believe it's mostly engaged at a different level of abstraction. I think it's plausible the science of science is, in part, attempting to convert epistemology from a part of philosophy into a science.

With all that said: (a) yes, the situation is very unusual; but (b) the practical thing is not to worry about it12, and just make progress where we can. Certainly, there are plenty of practical ways of making progress. As in the early days of any proto-science, in some areas there is little firm ground to stand on – finding that ground is part of the challenge! And, of course, many aspects of the science of science – study of the social dynamics, decision-making, and of the actual history – are on firm ground, and under intense and fruitful scrutiny.



Thanks to Anastasia Gamick for a conversation that pushed me to attempt to articulate these thoughts more clearly. Thanks also to David Chapman, Jed McCaleb, Adam Marblestone, and Ben Reinhardt for earlier conversations. And thanks to Kanjun Qiu both for conversations about the present topic, and many conversations about the science of science more broadly.

Citation information

For attribution in academic contexts, please cite this work as:

Michael Nielsen, "In what sense is the science of science a science?",, San Francisco (2022).


  1. It of course has large overlaps with disciplines including the history and philosophy of science, the sociology of science, scientometrics, science and technology studies, and others.↩︎

  2. Pierre Azoulay, "Turn the scientific method on ourselves", Nature (2014).↩︎

  3. Santo Fortunato et al, "Science of Science", Science (2018).↩︎

  4. R. Sinatra, D. Wang, P. Deville, C. Song, and A.-L. Barabási, Quantifying the evolution of individual scientific impact, Science (2016).↩︎

  5. Perhaps, in an act of collective unconscious that would make Jung proud, we have arranged our institutions to ensure career paths where people are equally likely to have a big discovery at any point during their career. This is not impossible, though perhaps not a priori likely. It would, in this case, be a purely collective psychological and organizational phenomenon. Many other explanations might also be concocted. But that is to digress from our main line.↩︎

  6. James Gleick, "Genius: The Life and Science of Richard Feynman" (1992).↩︎

  7. Diehard statistical modelers might say "oh, you should use probability theory, it's for reasoning in the presence of uncertainty". Maybe you can make progress reasoning about probability distributions over conceptions of reality in the adjacent possible. But to me that sounds like too big an ask.↩︎

  8. I suspect we're a long, long way from being able to do so in a lot of economic contexts, too.↩︎

  9. This is very approximate, and there are many important caveats. See, e.g., the discussion in: Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman, Power-law distributions in empirical data, SIAM Review (2009).↩︎

  10. I cannot find the reference, and it's possible I'm misquoting. Attribution or correction most welcome.↩︎

  11. Of course, funding operates at a large scale. So, in some sense, what is required is not an individual theory of discovery. Rather, it's more a macroscopic theory of discovery-in-the-aggregate, a thermodynamics of discovery. It takes what was formerly in the domain of individual taste and intuition, and attempts to study it scientifically.↩︎

  12. I'm reminded of David Mermin's comment on quantum mechanics: it's a weird theory which human beings don't really understand, but it's a mistake to spend too much time pondering that weirdness. Rather, 6 days a week you should just go ahead and use quantum mechanics to practically improve our understanding of the world; on the 7th day you can permit yourself to ponder the strangeness. So this is a kind of "day 7" essay, attempting to articulate something about the big picture.↩︎