Why Vegetarians Miss Fewer Flights:  Five Bizarre Insights from Data – The Dr. Data Show

Why Vegetarians Miss Fewer Flights: Five Bizarre Insights from Data – The Dr. Data Show


♪ I love it when you call me Big Data ♪ Welcome to the Dr. Data Show! I’m Eric Siegel. It’s weird, vegetarians
miss fewer flights. It’s wacky, people who like curly fries on Facebook are more intelligent. You’re traveling through a
dimension whose boundaries are that of imagination. Well, it’s not really The Twilight Zone, but let’s call it the
Freakonomics of big data, or the Ripley’s Believe
It or Not of data science. We live in a weird and wacky world, full of bizarre and
surprising connections, and these connections are reflected within all those tons of data
constantly being collected. That’s what makes data
the world’s most potent, flourishing unnatural resource. And so the world has enthusiastically dived into a golden age
of data discoveries. A frenzy of number-crunching
is churning out heaps of insights that are colorful, sometimes surprising, and often valuable. So, I’m gonna explain how this works, and investigate five bizarre
discoveries found in data. Now, the whole point of computers is to have things done automatically, so why not automate scientific discovery? Well, that’s, in fact,
often what’s achieved by machine learning, when computers learn from the experience encoded in data. You can think of the learning
process in two phases. Number one, find individual insights like the one about vegetarians, and number two, combine
the insights together to improve their capacity to predict. This is also called predictive modeling, since you’re combining them
together into a single model. Now, that’s actually a little simplified, ’cause machine learning often
interweaves these two phases, but we can think of the first
one, the hunt for insights, as a foundational part of the process. Having computers make such discoveries is the very automation
of scientific research. It’s a major paradigm shift that upends the traditional scientific method, which is to form a
hypothesis and then test it. For example, an airline might speculate that passengers who
request a vegetarian meal end up missing their flight less often. Then they’d examine data in
order to test this hypothesis. By the way, the reason
why this is the case for vegetarians is a separate question, which I’ll address in a couple minutes. The first question is
simply whether or not this little theory holds true. But, you know, there are so many trends like that you could check for. How about passengers
who prefer an aisle seat rather than a window,
or passengers traveling from certain cities, et cetera? Perhaps those groups are also less likely to miss their flight. We humans are pretty smart, but instead of sitting around conjecturing
on such things, why not just unleash the
computer to search through a whole bunch of possible trends to see which turn out to be true? Use your CPU as a
hypothesis-testing machine, a robot scientist. By hunting tirelessly and robotically, no stone is left unturned. The technical terms for doing this include feature selection, or trying
each independent variable one at a time as a univariate model. But whatever you call it,
this kind of search process reaches beyond the more
limited range of ideas that originate from human hunches. Allow it to do its number-crunching thing and the machine will spit
out valuable discoveries, which sometimes turn out to be unexpected and may seem to defy logic. Now, before you get too
excited about a potential robot scientist, here’s
an important warning. When you push the go button,
cranking up the open-ended, massive search for scientific discoveries, it can backfire and get
you into major trouble by drawing false conclusions. It might find apparent
trends within the data that don’t actually hold true in general. The technical word for
this pitfall is P-hacking. Actually, you can never ever, ever be literally 100% conclusive about
a connection found in data. There’s always at least some remote chance the data fooled you, like
even if you flip a coin 100 times and it comes
up heads every time, there’s still some very remote chance it’s just a normal coin and you had a crazy long streak of heads. But, with precautions that properly check the conclusions drawn from data, which apply a high standard
of statistical rigor, we can reduce the odds
of being mislead by data down to an acceptably remote long-shot. Alright, let’s see what our
robot scientist came up with. Here are several
connections found in data. Each one helps predict
things, like who will miss a flight, who is more intelligent, and who will prove to be more
financially creditworthy, and so these serve as
foundational building blocks with which machine learning
builds predictive models. I’ll first cover two typical examples that are pretty interesting and are useful in the commercial world, and then five truly kooky connections,
for a total of seven. Number one, if you buy
diapers, you’re more likely to also buy beer. A pharmacy chain found this trend among evening-time shoppers,
across dozens of outlets. Some have offered as an
explanation, Daddy needs a beer. Now, although I’ve heard people presume this classic diapers/beer
data mining example is an urban legend, you can
actually find the details about this and all the
examples I’m covering in the notes for my book,
Predictive Analytics. The notes are available for
free at PredictiveNotes.com. By the way, the book itself
includes a grand total of 46 bizarre discoveries,
all compiled into one single table, with examples
from the likes of Walmart, Shell, Microsoft, and Wikipedia. Now on to number two. If your friend switches
cell phone companies, you’re more likely to do so yourself. A major North American carrier discovered that a customer is seven times more likely to cancel if someone in the person’s calling network cancels. Birds of a feather
really do flock together. Number three, vegetarians
miss fewer flights. Well, to be a bit more
precise, an airline found that passengers who’ve
preordered vegetarian meals are less likely to miss their flight. Why is that so? I don’t know. To be completely honest,
the title of this episode of The Dr. Data Show, Why
Vegetarians Miss Fewer Flights, has a slight case of the clickbaits, ’cause the fact is, nobody knows for sure. The researchers have suggested it could be because the knowledge of a personalized or specific meal awaiting the customer establishes a sense of commitment. But we actually can’t conclusively answer the why for most of these
insights or discoveries, at least not without further research. That’s what’s meant by
the often-heard adage, correlation does not entail causation. Each of these discoveries,
trends, or links in the data, whatever you wanna call
them, are a correlation, and any explanation as to why they’re true would involve understanding causation. When analyzing the data
that businesses accumulate, that is, typical big data,
rather than analyzing data collected specifically
for experiments or scientific inquiry, we
often only get the what, but not the why. However, it still helps predict. An airline can still use this
discovery to help calculate, for example, how overbooked
a flight is likely to be, even without understanding why it’s true. Number four, people who like curly fries on Facebook are more intelligent. Researchers at the University of Cambridge and Microsoft found
that liking curly fries is predictive of high
scores on a certain test designed to measure intelligence. The researchers don’t think
it’s necessarily because you have to be smart to realize
how great curly fries are. Rather, they suspect that
some intelligent person was the first to like the
curly fries Facebook page, and his or her friends saw it, and thus seeing and liking the page
spread through a network of relatively smart people. By the way, regarding the
more intelligent part, I’m personally pretty
skeptical about the paring down of human intelligence to a single number, but so-called intelligence
tests probably do measure how good you are at some
valuable group of skills that cover some, but certainly not all, of what we mean by intelligence. Just saying. Number five, men who skip
breakfast have a greater risk of coronary heart disease. Harvard University
medical researchers found that men 45 to 82 who skip
breakfast have a 27% higher risk. This isn’t necessarily because
of any direct health effects of the meal itself, but rather
because eating breakfast may be a proxy for lifestyle. People living a high-paced,
more stressful life are more likely to skip
breakfast and are also subjected to a higher health risk. Again, that’s largely
just an intuitive hunch. As usual, there could also be
other plausible explanations. Number six, neighborhoods in San Francisco that exhibit higher rates of certain crime also have a higher demand for Uber rides. This is not necessarily
because criminals are taking Ubers, but rather,
as Uber has postulated, because the crime is a proxy
for non-residential population. Those happen to also be
the areas where there are more people who don’t actually live there. And finally, number seven,
people who go to bars are a higher credit risk. Canadian Tire issues
credit cards and looked at how their customers use their card relative to their on-time bill payments. It turns out that people observed spending at a drinking establishment
are more likely than average to miss repeated
credit card bill payments. However, if you spend with
your card at the dentist, you’re a lower credit risk, less likely to miss bill payments. And if you buy those little felt pads that keep the legs of your chair from scratching the
floor, lower credit risk. You’re a more reliable bill payer. And in a related story, typing
with proper capitalization corresponds with creditworthiness, according to a financial
services startup company that analyzed how online loan
applications were filled out. Now, the reasoning behind
these trends may seem intuitive and self-evident, as far as the kinds of personalities
and how organized people are, but, again, remember to
take such interpretations and hunches with a huge grain of salt. This freak show of surprising discoveries delivers predictive value, but does little to explain itself,
scientifically speaking. I’m Eric Siegel. Thanks for watching. Hit like and share this video
if you think your friends would also be interested in
vegetarians, curly fries, and the automation of scientific research. And for access to the entire web series, go to TheDoctorDataShow.com. ♪ Who’s your data ♪ ♪ Provide me the data to improve ♪ ♪ And I’ll apply the computation ♪ ♪ I love it when you call me Big Data ♪ ♪ Predictive analytics can
help you with decisions ♪ ♪ You can call, mail, credit,
or hire with precision ♪ ♪ On law, love, and life,
you can prognosticate ♪ ♪ Whom to investigate, incarcerate ♪ ♪ Set up on a date, or medicate. ♪ ♪ Charlie Brown never gets his kicks ♪ ♪ That’s why every old dog
needs a brand new trick ♪ ♪ If you get sick of chasing sticks ♪ ♪ Or clicks with just a quick fix ♪ ♪ You need to learn to predict ♪ ♪ I can predict your every move ♪ ♪ Just gimme all your information, yeah ♪ ♪ Who’s your data ♪ ♪ Provide me the data to improve ♪ ♪ And I’ll apply the computation ♪ ♪ I love it when you call me Big Data ♪