Adventures in Sparkland (or... How I Learned that Michael Caine was the original Jason Bourne)

Ready, set, revive data blog! What better way to take advantage of the sketchy wifi I’ve encountered along my travels through South America than to do do some data science?

For some time now, I’ve wanted to get my feet wet with Apache Spark, the open source software that has become a standard tool on the data scientist’s utility belt when it comes to dealing with “big data.” Specifically, I was curious how Spark can understand complex human-generated text (through topic or theme modeling), as well as its ability to make recommendations based on preferences we’ve expressed in the past (i.e. how Netflix decides what to suggest you should watch next). For this, it only seemed natural to focus my energies on something I am also quite passionate about: Movies!

Many people have already used the well known and publicly available Movielens dataset (README, data) to test out recommendation engines before. To add my own twist on standard practice, I added a topic model based off of movie plot data that I scraped from Wikipedia. This blog post will go into detail about the whole process. It’s organized into the following sections:

Setting Up The Environment
#ScrapeMyPlot
Model Dem Topics
Rev Your Recommendation Engines…
Results/Findings

Setting Up The Environment

To me, this is always the most boring part of doing a data project. Unfortunately, this yak-shaving is wholly necessary to ever do anything interesting. If you only came to read about how this all relates to movies, feel free to skip over this part…

I won’t go into huge depth here, but I will say I effin love Docker as a means to set-up my environment. The reason Docker is so great is that it makes a dev environment totally explicit and portable—which means anybody who’s actually interested in the gory details can go wild with them on my Github (and develop my project further, if they so please).

Another reason Docker is the awesomest is that it made the process of simulating a cluster on my little Macbook Air relatively straightforward. Spark might be meant to be run on a cluster of multiple computers, but being on a backpacker’s budget, I wasn’t keen on commandeering a crowd of cloud computers using Amazon Web Services. I wanted to see what I could do with what I had.

The flip side of this, of course, is that everything was constrained to my 5-year-old laptop’s single processor and the 4GB of RAM I could spare to be shared by the entire virtual cluster. I didn’t think this would be a problem since I wasn’t dealing with big data, but I did keep running up against some annoying memory issues that proved to be a pain. More about that later.

#ScrapeMyPlot

The first major step in my project was getting ahold of movie plot data for each of the titles in the Movielens dataset. For this, I wrote a scraper in python using this handy wikipedia python library I found. The main idea behind my simple program was to: 1) search wikipedia using the title of each movie, 2) Use category tags to determine which search result was the article relating to the actual film in question, and 3) Use python’s BeautifulSoup and Wikipedia’s generally consistent html structure to extract the “plot” section from each article.

I wrapped these three steps in a bash script that would keep pinging wikipedia until it had attempted to grab plots for all the films in the Movielens data. This was something I could let run overnight or while trying to learn to dance like these people (SPOILER ALERT: I still can’t)

The results of this automated strategy were fair overall. Out of the 3,883 movie titles in the Movielens data, I was able to extract plot information for 2,533 or roughly 2/3 of them. I was hoping for ≥ 80%, but what I got was definitely enough to get started.

As I would later find however, even what I was able to grab was sometimes of dubious quality. For example, when the scraper was meant to grab the plot for Kids, the risqué 90’s drama about sex/drug-fueled teens in New York City, it grabbed the plot for Spy Kids instead. Not. the. same. Or when it was meant to grab the plot for Wild Things, another risqué 90’s title (but otherwise great connector in the Kevin Bacon game), it grabbed the plot for Where The Wild Things Are. Again, not. the. same. When these movies popped up in the context of trying to find titles that are similar to Toy Story, it was definitely enough to raise an eyebrow…

All this points to the importance of eating your own dog food when it comes to working with new, previously un-vetted data. Yes, it is a time consuming process, but it’s very necessary (and at least for this movie project, mildly entertaining).

Model Dem Topics

So first, one might ask: why go through the trouble of using a topic model to describe movie plot data? Well for one thing, it’s kinda interesting to see how a computer would understand movie plots and relate them to one another using probability-based artificial intelligence. But topic models offer practical benefits as well.

For one thing, absent a topic model, a computer generally represents a plot summary (or any document for that matter) as a bag of the words contained in that summary. That can be a lot of words, especially because a computer has to keep track of the words in the summary of not just a single movie, but rather the union of all the words in all the summaries of all the movies in the whole dataset.

Topic models reduce the complexity of representing a plot summary from a whole bag of words to a much smaller set of topics. This makes storing information about movies much more efficient in a computer’s memory. It also significantly speeds up calculations you might want to perform, such as seeing how similar one movie plot is to another. And finally, using a topic model can potentially help the computer describe the similarities between movies in a more sensible way. This increased accuracy can be used to improve the performance of other models, such as a recommendation engine.

Spark learns the topics across a set of plot summaries using a probabilistic process known as Latent Dirichlet Allocation or LDA. I won’t describe how LDA works in great depth (look here if you are interested in learning more), but after analyzing all the movie plots, it spits out a set of topics, i.e. lists of words that are supposed to be thematically related to each other if the algorithm did its job right. Each word within each topic has a weight proportional to its importance within the topic; words can repeat across topics but their weights will differ.

One somewhat annoying thing about using LDA is that you have to specify the number of topics before running the algorithm, which is an awkward thing to pinpoint a priori. How can you know how exactly how many topics exist across a corpus of movies—especially without reading all of the summaries? Another wrinkle to LDA is how sensitive it can be to the degree of pre-processing performed upon a text corpus before feeding it to the model.

After settling on 16 topics and a slew of preprocessing steps (stop word removal, Porter stemming, and part-of-speech filtering), I started to see topics that made sense. For example, there was a topic that broadly described a “Space Opera”:

Top 20 most important tokens in the “Space Opera” topic:

[ship, crew, alien, creatur, planet, space, men, group, team, time, order, board, submarin, death, plan, mission, home, survivor, offic, bodi]

Another topic seemed to be describing the quintessential sports drama. BTW, the lopped-off words like submarin or creatur are a result of Porter stemming, which reduces words to their more essential root forms.

Top 20 most important tokens in the “Sports Drama” topic:

[team, famili, game, offic, time, home, friend, player, day, father, men, man, money, polic, night, film, life, mother, car, school]

To sanity check the topic model, I was curious to see how LDA would treat films that were not used in the training of the original model. For this, I had to get some more movie plot data, which I did based on this IMDB list of top movies since 2000. The titles in the Movielens data tend to run a bit on the older side, so I knew I could find some fresh material by searching for some post-2000 titles.

To eyeball the quality of the results, I compared the topic model with the more simple “bag of words” model I mentioned earlier. For a handful of movies in the newer post-2000 set, I asked both models to return the most similar movies they could find in the original Movielens set.

I was encouraged (though not universally) by the results. Take, for example the results returned for V for Vendetta and Minority Report.

Similarity Rank: V for Vendetta

Similarity Rank	Bag of Words	Topic Model
1	But I’m a Cheerleader	Candidate, The
2	Life Is Beautiful	Dersu Uzala
3	Evita	No Small Affair
4	Train of Life	Terminator 2: Judgment Day
5	Jakob the Liar	Schindler’s List
6	Halloween	Mulan
7	Halloween: H20	Reluctant Debutante, The
8	Halloween II	All Quiet on the Western Front
9	Forever Young	Spartacus
10	Entrapment	Grand Day Out, A

Similarity Rank: Minority Report

Similarity Rank	Bag of Words	Topic Model
1	Blind Date	Seventh Sign, The
2	Scream 3	Crow: Salvation, The
3	Scream	Crow, The
4	Scream of Stone	Crow: City of Angels, The
5	Man of Her Dreams	Passion of Mind
6	In Dreams	Soylent Green
7	Silent Fall	Murder!
8	Eyes of Laura Mars	Hunchback of Notre Dame, The
9	Waking the Dead	Batman: Mask of the Phantasm
10	I Can’t Sleep	Phantasm

Thematically, it seems like for these two movies, the topic model gives broadly more similar/sensible results in the top ten than the baseline “bag of words” approach. (Technical note: the “bag of words” approach I refer to is more specifically a Tf-Idf transformation, a standard method used in the field of Information Retrieval and thus a reasonable baseline to use for comparison here.)

Although the topic model seemed to deliver in the case of these two films, that was not universally the case. In the case of Michael Clayton, there was no contest as to which model was better:

Similarity Rank: Michael Clayton

Similarity Rank	Bag of Words	Topic Model
1	Firm, The	Low Down Dirty Shame, A
2	Civil Action, A	Bonfire of the Vanities
3	Boiler Room	Reindeer Games
4	Maybe, Maybe Not	Raging Bull
5	Devil’s Advocate, The	Chasers
6	Devil’s Own, The	Mad City
7	Rounders	Bad Lieutenant
8	Joe’s Apartment	Killing Zoe
9	Apartment, The	Fiendish Plot of Dr. Fu Manchu, The
10	Legal Deceit	Grifters, The

In this case, it seems the Bag of Words model picked up on the legal theme while the topic model completely missed it. In the case of The Social Network, something else curious (and bad) happened:

Similarity Rank: The Social Network

Similarity Rank	Bag of Words	Topic Model
1	Twin Dragons	Good Will Hunting
2	Higher Learning	Footloose
3	Astronaut’s Wife, The	Grease 2
4	Substitute, The	Trial and Error
5	Twin Falls Idaho	Love and Other Catastrophes
6	Boiler Room	Blue Angel, The
7	Birdcage, The	Lured
8	Quiz Show	Birdy
9	Reality Bites	Rainmaker, The
10	Broadcast News	S.F.W.

With Good Will Hunting—another film about a gifted youth hanging around Cambridge, Massachusetts—it seemed like the topic model was off to a good start here. But then with Footloose and Grease 2 following immediately after, things start to deteriorate quickly. The crappy-ness of both result sets speaks to the overall low quality of the data we’re dealing with—both in terms of the limited set of movies available in the original Movielens data, as well as the quality of the Wikipedia plot data.

Still, when I saw Footloose, I was concerned that perhaps there might be a bug in my code. Digging a little deeper, I discovered that both movies did in fact share the highest score in a particular topic. However, the bulk of these scores are earned from different words within this same topic. This means that the words within the topics of the LDA model aren’t always very related to each other—a rather serious fault since that is exactly what it is meant to accomplish.

The fact is, it’s difficult to gauge the overall quality of the topic model even by eyeballing a handful of results as I’ve done. This is because like any clustering method, LDA is a form of unsupervised machine learning. That is to say, unlike a supervised machine learning method, there is no ground truth, or for-sure-we-know-it’s-right label, that we can use to objectively evaluate model performance.

However, what we can do is use the output from the topic model as input into the recommendation engine model (which is a supervised model). From there, we can see if the information gained from the topic model improves the performance of the recommendation engine. That was, in fact, my main motivation for using the topic model in the first place.

But before I get into that, I did want to share perhaps the most entertaining finding from this whole exercise (and the answer to the clickbait-y title of this blog post). The discovery occurred when I was comparing the bag of words and topic model results for The Bourne Ultimatum:

Similarity Rank: The Bourne Ultimatum

Similarity Rank	Bag of Words	Topic Model
1	Pelican Brief, The	Three Days of the Condor
2	Light of Day	Return of the Pink Panther, The
3	Safe Men	Ipcress File, The
4	JFK	Cop Land
5	Blood on the Sun	Sting, The
6	Three Days of the Condor	Great Muppet Caper, The
7	Shadow Conspiracy	From Here to Eternity
8	Universal Soldier	Man Who Knew Too Little, The
9	Universal Soldier: The Return	Face/Off
10	Mission: Impossible 2	Third World Cop

It wasn’t the difference in the quality of the two result sets that caught my eye. In fact, with The Great Muppet Caper in there, the quality of the topic model seems a bit suspect, if anything.

What interested me was the emphasis the topic model placed on the similarity of some older tiles, like Three Days of the Condor, or The Return of the Pink Panther. But it was the 1965 gem, The Ipcress File, that took the cake. Thanks to the LDA topic model, I now know this movie exists, showcasing Michael Caine in all his 60’s badass glory. That link goes to the full trailer. Do yourself a favor and watch the whole thing. Or at the very least, watch this part, coz it makes me lol. They def don’t make ’em like they used to…

Rev Your Recommendation Engines

To incorporate the topic data into the recommendation engine, I first took the top-rated movies from each user in the Movielens dataset and created a composite vector for each user based on the max of each topic across their top rated movies. In other words, I created a “profile” of sorts for each user that summarized their tastes based on the most extreme expressions of each topic across the movies they liked the most.

After I had a profile for each user, I could get a similarity score for almost every movie/user pair in the Movielens dataset. Mixing these scores with the original Movielens ratings is a bit tricky, however, due to a wrinkle in the Spark recommendation engine implementation. When training a recommendation engine with Spark, one must choose between using either explicit or implicit ratings as inputs, but not both. The Movielens data is based on explicit ratings that users gave movies between 1 and 5. The similarity scores, by contrast, are signals I infer based on a user’s top-rated movies along with the independently trained topic model described above. In other words, the similarity scores are implicit data—not feedback that came directly from the user.

To combine the two sources of data, therefore, I had to convert the explicit data into implicit data. In the paper that explains Spark’s implicit recommendation algorithm, training examples for the implicit model are based off the confidence one has that a user likes a particular item rather than an explicit statement of preference. Given the original Movielens data, it makes sense to associate ratings of 4 or 5 with high confidence that a user liked a particular movie. One cannot, however, associate low ratings of 1, 2, or 3 with a negative preference, since in the implicit model, there is no notion of negative feedback. Instead, low ratings for a film correspond only to low confidence that a user liked that particular movie.

Since we lose a fair amount of information in converting explicit data to implicit data, I wouldn’t expect the recommendation engine I am building to beat out the baseline Movielens model, seeing as explicit data is generally a superior basis upon which to train a recommendation engine. However, I am more interested in seeing whether a model that incorporates information about movie plots can beat a model that does not. Also, it’s worth noting that many if not most real-world recommendation engines don’t have the luxury of explicit data and must rely instead on less reliable implicit signals. So if anything, handicapping the Movielens data as I am doing makes the setting more realistic.

Results/Findings

So does the movie topic data add value to the recommendation engine? Answering this question proved technically challenging, due to the limitations of my old Macbook Air :sad:.

One potential benefit of incorporating movie topic data is that scores can be generated for any (user, movie) pair that’s combinatorially possible given the underlying data. If the topic information did in fact add value to the recommendation engine, then the model could train upon a much richer set of data, including examples not directly observed in real life. But as I mentioned, my efforts to explore the potential benefit of this expanded data slammed against the memory limits I was confined to on my 5-year-old Macbook.

My constrained resources provided a lovely opportunity to learn all about Java Garbage Collection in Spark, but my efforts to tune the memory management of my program proved futile. I became convinced that an un-tunable hard memory limit was the culprit when I saw repeated executors fail after max-ing out their JVM heaps while running a series of full garbage collections. The Spark tuning guide says that if “a full GC is invoked multiple times for before a task completes, it means that there isn’t enough memory available for executing tasks.” I seemed to find myself in exactly this situation.

Since I couldn’t train on bigger data, I pretended I had less data instead. I trained two models. In one model, I pretended that I didn’t know anything about some of the ratings given to movies by users (in practice this meant setting a certain percentage of ratings to 0, since in the implicit model, 0 implies no confidence that a user prefers an item). In a second model, I set these ratings to the similarity scores that came from the topic model.

The results of this procedure were mixed. When I covered up 25% of the data, the two recommendation engines performed roughly the same. However, when I covered up 75% of the data, there was about a 3% bump in performance for the topic model-based recommendation engine.

Although there might be some benefit (and at worst no harm) to using the topic model data, what I’d really like to do is map out a learning curve for my recommendation engine. In the context of machine learning, learning curves are curves that chart algorithm performance as a function of the number of training samples used to train the algorithm. Based on the two points I sampled, we cannot know for certain whether the benefit of including topic model data is always crowded out by the inclusion of more real world samples. We also cannot know whether using expanded data based on combinatorially generated similarity scores improves engine performance.

Given my hardware limits and my commitment to using only the resources in my backpack, I couldn’t map out this learning curve more methodically. I also couldn’t explore how using a different number of topics in the LDA model affects performance—something else I was curious to explore. In the end, my findings are only suggestive.

While I couldn’t explore everything I wanted, I ultimately learned a butt-load about how Spark works, which was my goal for starting this project in the first place. And of course, there was The Ipcress File discovery. Oh what’s that? You didn’t care much for The Ipcress File? You didn’t even watch the trailer? Well, then I have to ask you:

http://s2.quickmeme.com/img/fc/fca51cd9a8e2cadf3c61aa0b5f97f5aa6a9269c73d180215286f2676b6ef9862.jpg

Adventures in Sparkland (or… How I Learned that Michael Caine was the original Jason Bourne)