I am back in an English class, for the first time since 1979. Signum University is running a class called “Research Methods”. I signed up because I’m old. Two years ago I discovered that, although I was a state-of-the-art statistician in 1982, the things I know don’t count as knowing statistics any more. The same thing may have happened here. And so it appears. Half the syllabus sounds like the first month or two of this blog. (Good – I’m not doing it wrong!) The other half is things I’ve never even thought of. (Better!)
One of the books they’re making us read is called The Craft of Research. I like the word “craft” there. Research is not a science , and it would be pretentious to call it an art. It’s something in between. It’s an excellent book in almost all ways. My reactions to it alternated among “obviously – what else would one do?”; “have you been looking over my shoulder?”; and “wait – I thought I invented that!” But there’s one point with which I must take issue.
Chapter 3 is an orc’s breakfast. Their guidance about doing research that doesn’t make people ask,”so what?” is to think on three levels:
- I am studying x,
- Because I want to find out y(x),
- Which will help the reader understand Important Thing z, of which y is an element.
They talk as if you do research by starting with your source of data. I would have had no objection to this formulation in the 20th Century. Now, though, this is the canonical drunk looking for his keys under the lamppost. In the age of Cheap Data it has become a trap.
Most people who like to talk about the leading edge of technical progress say “big data”, and justify its importance by telling stories of google searches and flu outbreaks. But when you ask them the most basic question, “How big is it?”, you find that they aren’t all talking about the same thing. There’s one definition I actually like: “Big data is big enough that it won’t fit on a single machine — which means you need to use specialized tools to muck with it.” Readers of this blog know how much I like Wikipedia, but in this case they let me down: “Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate to deal with them.” (They then go on to list the same jobs everybody has ever had with collecting measurements of any kind.) People who sell storage and processing power like to brag that what you’re thinking of won’t challenge their machines. I have a certain affection for the smartass response: “If you have to ask this question, your amount of data isn’t that big 🙂 …”. But there’s no way to argue that the term is well defined. That’s why, instead, I say “cheap data”. That’s what it really is. Anyone who’s ever assembled a large set of measurements by hand knows exactly what I mean.
The world is now full of databases. I work with dozens of people who build and maintain them. For them, Step 1 is a given. They’re studying their database because that’s what they do. Why anyone should care is above their pay grade. When I’m a reviewer, I get papers with this mistake in them all the time. (It does not go well for the authors’ major professors.)
To avoid the seductions of databases , the sequence ought to go:
- Thing z is important, and readers will understand it better if they know y.
- Thing y is a function of x, which is accessible through means I’m good at,
- So I’m studying x, and here’s what I found.
I don’t obey this structure with perfect fidelity. This post and this one are pretty much of the form, “I’ve got a database and nobody can stop me from using it.” That’s OK for a blog (in moderation) because this is a place for scintillating insights, wild-goose chases, and things that turn out to be dumb, without discrimination on the basis of merit. But mostly I’ve stuck to my preferred structure. And if the rest of the world doesn’t come along with me, well, let a hundred flowers bloom; our papers won’t all sound the same.