To extend this assignment our group decided to cluster the same movies by keywords. We did this for a few reasons: first, we simply wanted to see which would work better, we also wanted to see if the increased number of factors clustered by gave a positive or negative effect on the resulting cluster and lastly we wanted to see how the increased number of factors effected the size and variance of the clusters.
The first thing I had to do was strip the movies of their data and store the movie names in a text file. Then line by line I had to compare the movie name in the name text file with each line of the keywords text file (example of files shown below). If the movie name matched the name of the movie on that line, it would then get the next word on the line which was the keyword.
Example of keyword file:
batteries not included (1987) fire-sprinkler
*batteries not included (1987) restaurant
*batteries not included (1987) flying
*batteries not included (1987) kids-and-family
*batteries not included (1987) title-spoken-by-character
10 (1979) songwriter
10 (1979) mexico
10 (1979) newlywed
10 (1979) beach
10 (1979) rescue
10 (1979) birthday-party
10 (1979) convertible
10 (1979) acapulco-mexico
The code that did this:
movies = open("names.txt")
names = movies.readlines()
movies.close()
key = open("keywords1.list")
keys = key.readlines()
key.close
filename = "keylist4.dat"
output = open(filename, "w")
for line in names:
length = len(line)
word = line[0:length-1]
j = 0
for letter in range(0, length-1):
if(word[j] != ","):
j=j+1
else:
word = "The "+word[0:j]
for line1 in keys:
p = line1.strip().split('\t')[0:]
name = p[0]
num = len(name)
i = 0
for letters in range(0,num):
if(name[i]!="("):
i = i+1
else:
length2 = i
movie = name[0:length2-1]
#print movie
if(word == movie):
output.write(line1)
print line1
output.close()
This code took the word, got rid of the newline char, made sure that the word "the" was in the correct spot ( i did the rest of the comma words by hand as there were not many, I just had to make the the title was in the correct order, not like "Punisher, The" by like "The Punisher"). Next it got the name of the movie, the file with the keywords had the date in parenthases after the movie, so I had to remove that, once removed I would compare the movie title, if it matched, I grabbed the line and wrote it to a a file.
This process took 8 hours. This is because the keywords master file was over 2 million lines long and it had to be run through as many times as there were movies. First I tried running it normally and judging by how long it was taking to do one movie I calculated it was going to take somewhere around 24 hours. So I cut the movie file into 3 equal parts and ran the program in three separate instances, one on each core leaving one for the operating system. Like this it finished in 8 hours.
Now that I have my text file with a column of name and a column of keywords I needed to change the format to by like this:
Movie name,keyword,keyword,keyword,keyword,keyword
So I wrote the following program to do that:
Now that the data was in a more manageable format I needed to extract all the keywords from all the movie, I used this data file to look at each keyword one at a time, if it was not already added to the keyword array, I added it, and then wrote these out to a file. Next I compared each movie to this list of keywords. I would look at the list of all the keywords, and if the movie had that keyword it would put a 1 in the column, and if it didn't it would put a zero. This created a binary file that I could cluster on. The code for this looked as follows:key = open("keylist2.dat")
keys = key.readlines()
key.close
output = open("temp.dat", "w")
word = ""
temp = ""
keylist = []
for line1 in keys:
bool = "true"
p=line1.strip().split('\t')
word = colnames[0]
for column in p:
if(temp == ""):
temp = word
elif(temp == word):
if (column != colnames[0]):
if (column != ""):
l = column
length = len(l)
for item in keylist:
if(item == column[0:length]):
bool = "false"
if(bool == "true"):
print column
keylist.append(column)
output.write(column[0:length])
output.write(',')
else:
temp = word
The file it created would look like the output below, but with many more keywords (columns)key = open("temp.dat")
movie = open("movies2.txt")
movies = movie.readlines()
movie.close()
output = open("temp2.txt", "w")
word = ""
temp = ""
keylist = []
for i in range(0, 12135):
temp = key.readline()
length = len(temp)
#print temp[0:length-1]
keylist.append(temp[0:length-1])
for t in range(0, 1076):
print t
output.write('\n')
colnames=movies[t].strip().split('\t')[0:]
p=movies[t].strip().split('\t')
word = colnames[0]
for column in p:
for x in range(0, 12135):
if(column == keylist[x]):
output.write("1,")
else:
output.write("0,")
key.close()
Now that both data sets are clustered its time to look at the differences. First, the genre data set had 8 columns, the keywords data set had over 500. That is a very large difference in data pools to conduct the clustering. In general it seemed the more columns that were there the more accurate the clusters were. The only exception to this seemed to be when the movie had one or two keywords, it may not be clustered as neatly as its neighbors.
One thing to note is the share of the dendograhms. The genre picture is very strait and general, while the keywords file is wavy and erratic. This shows the amount of variance between the 2 clusters. To me is says the keywords cluster was much more specific. The pictures are shown below:
Now to look a a few examples
From Genre:
From Keywords:
When looking at this the first thing you notice is that neither is really wrong. But it is easy to see that the keywords picture is more specific, and much more accurate in matching movies. These particular pictures were centered around Scarface.
One more:
From Genre:
From Keywords:
This example show the advantages of keywords even better than the last. Not only are the movies clustered much better, but, as evident by the shape of the clusters, the keywords cluster is much more specific. In fact you almost never see any 2 movie in a cluster together, for this to happen they would have to be almost identical in keywords, and the odds against this are fairly high.
To wrap things up, keywords are a better substance to cluster movies by for a few reasons. The abundant amount of columns gives it more to cluster by and is therefore more accurate. Genre only has 8 variables, leaving the chance of 2 movies being the same too high. Also the specific nature of the keywords keep them in close proximity to like movies. All in all, keywords proved to be an excellent way to cluster movies, it gave a very large pool of words that gave the data set huge diversity while maintaining a common base of comparison.








