May 01 2014

Big Data Is Hard To Define… and Vulnerable

Categories: Rants Dave Rathbun @ 12:57 pm

Stephen Few weighed in on what is the proper definition of big data yesterday, and it’s an interesting read. If you don’t want to click through, I will summarize in one sentence: “Big data is nothing special, it’s just data.” Obviously Stephen’s opinions have not stopped (and won’t stop) people from using the term.

Next up on my blog reading list this morning was a trip to FiveThirtyEight.com. The headline article this morning was titled, “The Story Behind the Worst Movie on IMDB.” I’m guessing that IMDB doesn’t really qualify as “big data” as they have “only” 2.8 million titles in their database. 🙂 But the story wasn’t about big data, it was about the worst movie in the database as determined by public rankings. I would have expected the soundly panned “Battlefield Earth” (and it was one of the worst with an overall rating of 2.4), the unfortunate Halle Berry stinker “Catwoman” (3.3) or perhaps even the Paris Hilton vehicle “The Hottie and the Nottie” (which I’m somewhat ashamed to admit I even knew about and brings in a lowly rating of 1.8).

It turns out the worst rated movie was not any of these, but instead was a Bollywood production called “Gunday” which has a rating of 1.4. Over 91% of the posted ratings are one star! What happened? Was the movie really that terrible?

For the full story, click through to the story on fivethirtyeight. In summary: an entire country decided they didn’t like the movie and decided to do something about it.

…the movement has since become an online alliance of bloggers focused on protecting Bangladesh’s history and promoting the country’s image. That includes protesting “Gunday,” because of the film’s reference to the Bangladesh Liberation War as the Indo-Pak war. In its first 11 minutes, the movie claims that India alone defeated Pakistan, and implies that an independent Bangladesh was simply a result of the fight.

What happens when an entire country decides that a movie is bad? The movie becomes perceived as historically bad. More from the article:

For Paris Hilton’s “The Hottie & the Nottie” — currently rated second-worst of all time — to take over IMDb’s bottom spot, the next 41,000 voters would have to give it a 1.

Last year I wrote a blog post titled Is External Data Always Good?. This is one more example of how social media / crowd-sourced data can be skewed by a concentrated effort. Is “Gunday” really the worst movie of all time? Probably not. Most professional critics were not nearly as harsh, especially when compared to Paris Hilton’s effort. One user reviewed Paris’s acting by saying, “Paris Hilton’s acting made me lose braincells.” The reviews on IMDB were not spammers; they were unique individuals. They just happened to be part of a focused effort to trash a movie they perceived as historically inaccurate. (Please note: I am not making any assessment as to the accuracy of the film. I am far from an expert in that area so I’m neither endorsing nor rejecting the movie.)

Ultimately I think the article from FiveThirtyEight wraps it up the best.

Crowdsourcing can be a tremendously powerful way to get a consensus understanding of the world. Because the sample size is so large, there’s an assumption that whatever it yields is robust and true. But even with oversight, aggregated rankings are subject to unforeseen biases. Crowds are always big — but they’re not always wise. Sometimes it’s impossible to control which crowds are being sourced.

Big data is just data. But you still have to understand where it’s coming from in order to benefit from it.