8.3. Mining Social Media Data#
Data mining is the process of taking a set of data and trying to learn new things from it.
For example, social media data about who you are friends with might be used to infer your sexual orientation [h8]. Social media data might also be used to infer people’s:
Race
Political leanings
Interests
Susceptibility to financial scams
Being prone to addiction (e.g., gambling)
Additionally, groups keep trying to re-invent old debunked pseudo-scientific (and racist) methods of judging people based on facial features (size of nose, chin, forehead, etc.), but now using artificial intelligence [h9].
Social media data can also be used to infer information about larger social trends like the spread of misinformation [h10].
One particularly striking example of an attempt to infer information from seemingly unconnected data was someone noticing that the number of people sick with COVID-19 correlated with how many people were leaving bad reviews of Yankee Candles saying “they don’t have any scent” (note: COVID-19 can cause a loss of the ability to smell):
8.3.1. Spurious Correlations#
One thing to note in the above case of candle reviews and COVID is that just because something appears to be correlated, doesn’t mean that it is connected in the way it looks like. In the above, the correlation might be due mostly to people buying and reviewing candles in the fall, and diseases, like COVID, spreading most during the fall.
It turns out that if you look at a lot of data, it is easy to discover spurious correlations [h12] where two things look like they are related, but actually aren’t. Instead, the appearance of being related may be due to chance or some other cause. For example:
By looking at enough data in enough different ways, you can find evidence for pretty much any conclusion you want. This is because sometimes different pieces of data line up coincidentally (coincidences happen), and if you try enough combinations, you can find the coincidence that lines up with your conclusion.
If you want to explore the difficulty of inferring trends from data, the website fivethirtyeight.com [h14] has an interactive feature called “Hack Your Way To Scientific Glory” [h15] where, by changing how you measure the US economy and how you measure what political party is in power in the US, you can “prove” that either Democrats or Republicans are better for the economy. Fivethirtyeight has a longer article on this called “Science Isn’t Broken: It’s just a hell of a lot harder than we give it credit for.” [h16]