There’s a great deal I know only a little about, still more about which I know nothing, and a terrifyingly-small number of things I know quite a lot about. [/Donald Rumsfeld]
But one of the things which fall into the latter category is that of statistical sampling, because my very first real job was in the Statistics Department of what was then the largest marketing research company in the world (the Great Big Research Company, or A.C. Nielsen). And my specific area of expertise was in sample selection: the methodology of creating a sample, the data drawn from which would accurately represent reality. A single anecdote will suffice.
One of our major clients was the yogurt-producing subsidiary of a large dairy corporation (think: Yoplait). Our data was always being questioned by this company, because in some cases we would show their market share as being too small (the sales numbers didn’t jibe with their actual deliveries to stores, for example — a known quantity), or else, paradoxically, far too large, for exactly the same reason: all dependent on which geographical area we were reporting on.
My job was to investigate this phenomenon, and some months later I discovered the reason. The various smaller dairies’ yogurts were not being delivered to all the stores in the area, but in stores where they did have fridge space, they sold extremely well. Using a simple picture shows the problem:
Our sample of stores may have been representative of say, total grocery sales in the area (and it was), but when Yogurt sales were carved out, the sample simply sucked because of how the dairies’ distribution worked.
It’s a very complex problem, and it applies to just about any sample selection. In this case, there was no solution other than to broaden the sample, which would have cost too much. So unless the client was prepared to pay a much higher fee to get better data, they’d either have to live with suspect data or cancel their account altogether. (The end result was that they stopped looking at specific markets, and only bought data at the national level, which was acceptably accurate, but less useful to the local sales teams.)
I told you all that so I could talk about this.
Harris’ So-Called ‘Surge’ Is Thanks To Oversampling: Pollsters
In the meta data from the call centers college educated Dems are 3-4x more likely to answer than non-college. While weighting can help minimize the bias if done correctly it won’t totally eliminate the problem.
— Mark Davin Harris (@markdharris) August 16, 2024
Critics point out that many polls have been sampling a disproportionately smaller share of Republican voters compared to exit poll data from the 2020 presidential election. The result, they say, is a misleading “phantom advantage” for Ms. Harris. According to them, this skewed sampling could be a strategic move to boost enthusiasm and fundraising for Ms. Harris’ campaign.
Usually, when I talk about situations like this, I use a shorthand expression like: “They must have drawn their sample from the Harvard Faculty Lounge.”
Unscrupulous polling companies can (and do) draw their samples to show exactly what the clients want to see — tailoring the samples to produce the desired results. We used to call this the “K factor”: that number which when applied to the data will provide the result most favorable to the client. It’s more commonly known within the research community as “bullshit”, but it’s bullshit that will generate headlines — so ten guesses as to whether the mainstream media will accept such data uncritically, either because it favors their own bias/opinion or because they are completely incapable of analyzing the data properly. (If you answered “or both” to the above, go to the head of the class.)
So is the “Kamala Surge” real, or not? Given all the players in this particular piece of theater… oh please, it’s patent bullshit.