For data scientists and marketers, big data gives us access to an unthinkable amount of information on consumers. The data reveals a deep level of insight, creating a statistical Holy Grail, but also adds a layer of potential pitfalls.
As people spend more and more of their time with their faces buried in their mobile devices or on social platforms, they are creating more and more data that can be mined, analyzed, and interpreted. From cell phone GPS signals and credit card transactions to digital photographs and social media posts, these interactions create some 2.5 quintillion bytes of consumer data per day at last count. Data at this level of detail and volume lets us measure what was previously unmeasurable and enables today’s world of ensemble modeling.
But in the era of big data, marketers, advertisers, and data scientists are often at the mercy of algorithms that decide when, how, and what data our audiences receive. We’re not marketing to people anymore; we’re marketing to the machines that market to people. Machines have become a proxy for our audiences.
The big data bias
For those of us tasked with trying to make sense of data, either from a research standpoint or for product development purposes, this has intense implications. Our growing understanding of the biases that are hard-coded into the platforms means a large portion of the preferred data that we’re using could be b.s. There’s also the risk that, from an applied standpoint, we’re basing our predictions and insights from what a machine thinks humans like from the clicks, likes, and views that it’s designed to optimize against. So it’s not just that social media, for example, is creating echo chambers. It’s that the data from these limited, biased, and manipulated echo chambers is being used to create models which are supposed to reflect our diverse and often unpredictable human audiences.
“By optimizing content for social networks, we risk building models to beat algorithms—not to understand users and delight fans.”
Consider social media, which is used by nearly 70% of Americans. While social platforms connect people, that’s not really their primary function. (If it were, the feeds would list every post in sequence and users would decide what to spend time on.) Instead, these social platforms use algorithms to prioritize content that encourages people to like or buy something. This is a super important distinction for us data wonks. It means that by optimizing content or campaigns for these social networks, we risk building models to beat algorithms—not to understand users and delight fans.
In essence, social platform data is a self-fulfilling prophecy. Under the auspices of promoting content, social media platforms present us with limited options from a small subset of our network in a self-supported feedback loop. In anecdotal audits of my own Facebook feed, I see posts from about 30 friends (or about 5% of my network), covering mainly political/social activism, baby pictures, and memes. I know that if I were to like or comment on these type of posts from any of those 30 people, and if this is all they post, then I will continue to mostly only see this sort of content. And if our teams at Viacom were to ingest my Facebook data, this would be the picture of me—which I can attest is not the whole me.
3 risks of using social media in data-driven marketing:
- Clicks lack context.
Platforms favor content that makes money…for the platforms. When I open Facebook, the first thing I see is an ad for one of those home-delivery shaving companies. If a data scientist were to see a like or click from me they would say, “Kodi must like home delivery shaving companies.” But this kind of faulty logic removes a degree of vital context when analyzing data. Even if the ad is relevant, by clicking on it, I’m not indicating this was the “best” possible piece of content/ad to show me. It means that this ad beat out other similar types of ads with similar size budgets that had a chance of showing up in my feed at that point in time.
- Bots are ubiquitous.
A recent research study by the University of Southern California concluded that bots makeup about 400,000 accounts on Twitter and generate somewhere close to 3.8 million tweets. Twitter has specific algorithms to filter what users see, but also a huge bot problem where ~15% of its content is not coming from actual humans. The human-to-bot ratio makes it difficult to draw any clear conclusions about sentiment or support.
- Dominant groups can distract.
Trolls and hyper-users dominate the content on structured social networks. Think about it; how much of Twitter is Donald Trump 24/7, or how much time is spent watching J.K. Rowling go rounds with a troll over her political comments? This is the stuff we see because it generates replies and retweets. We know, as users, that this content isn’t representative of the entire breadth of human conversation and interests. Yet, as marketers, we use this data to define audiences and even entire generations.
When social media becomes a social contagion
Machine learning has basically created different versions of the world. This is an onion in the ointment for data. It makes it difficult for data analysts to normalize our models to reflect a consistent reality. Not only is there machine bias, but there’s also the widening variance of what we collectively perceive as reality. How do we build models that predict against infinite versions of reality trained on information that isn’t even true?
There’s a clue in the research of academics, who have been studying how information is disseminated and discovered across social media. They’ve found there are similarities between how viral health epidemics and social media epidemics spread, with an interesting difference: social epidemics spread faster because they disperse within structured social networks.
Researchers led by Kristina Lerman, a project leader at University of Southern California’s Information Sciences Institute, found that within these social communities, a few well-connected sources can skew entire online communities thanks to what they’ve dubbed the majority illusion. Essentially, users overestimate the prevalence of features in a population because they see it several times within their small network. So, “even a minority opinion can appear to be extremely popular,” according to the research paper. Soon enough, there’s no global warming, vaccinations don’t work, and yes, clowns are trying to kidnap your children. Sprinkle in some cognitive heuristics and you’ve embarked on a war on reality.
Even Mark Zuckerberg, in his 6,000-word mea culpa, touched on social platforms taking social contagions airborne without a means to stop them, and that the fragmenting of a collective perception of human understanding is one of the biggest concerns for the Information Age.
Adopting a mindset for solutions
So should we, as data jockeys, attempt to stop the distortion of reality? If yes, whose responsibility is it, and how do they do it? TV traditionally played this role because the medium is less of a structured network and allows for the connection of weak ties, which in turn, prevents social contagions because the people we don’t really know act as a cross-validating mechanism. Is it time for TV to take up the mantle again? Do we need the ghost of Walter Cronkite to pull us back together?
Or, as marketers, do we attempt to beat the machines ranking algorithms using the biased data and engagement metrics? Ultimately, should it be our responsibility to have our models represent reality?
I will say, on my team, we’ve changed how we approach the weight of social media data. We’re still looking at big data as an input to power our machine-learning techniques. But we’ve also returned to using a lot of survey, credit card, viewing, and retail data because it’s cleaner and more deterministic.
Ultimately, there’s no “answer” or silver bullet. As social platforms evolve—as well as how they’re used—we have to constantly find new ways to understand what we are really seeing from the ever-growing reams of available data. It’s a necessary and ongoing process, and part of our responsibility to find ethical models that are best for our businesses and clients.