Thursday, April 30, 2009

Portfolio 9

For our final project our team decided to expand on the clustering assignment done earlier in the semester. For the first assignment we extracted the top 1000 movies out of the IMDB database and clustered them by genre. This, as noted in a previous post, worked fairly well.
To extend this assignment our group decided to cluster the same movies by keywords. We did this for a few reasons: first, we simply wanted to see which would work better, we also wanted to see if the increased number of factors clustered by gave a positive or negative effect on the resulting cluster and lastly we wanted to see how the increased number of factors effected the size and variance of the clusters.

The first thing I had to do was strip the movies of their data and store the movie names in a text file. Then line by line I had to compare the movie name in the name text file with each line of the keywords text file (example of files shown below). If the movie name matched the name of the movie on that line, it would then get the next word on the line which was the keyword.

Example of keyword file:

batteries not included (1987) fire-sprinkler

*batteries not included (1987) restaurant

*batteries not included (1987) flying

*batteries not included (1987) kids-and-family

*batteries not included (1987) title-spoken-by-character

10 (1979) songwriter

10 (1979) mexico

10 (1979) newlywed

10 (1979) beach

10 (1979) rescue

10 (1979) birthday-party

10 (1979) convertible

10 (1979) acapulco-mexico


The code that did this:

movies = open("names.txt")
names = movies.readlines()
movies.close()
key = open("keywords1.list")
keys = key.readlines()
key.close

filename = "keylist4.dat"
output = open(filename, "w")

for line in names:
length = len(line)
word = line[0:length-1]
j = 0
for letter in range(0, length-1):
if(word[j] != ","):
j=j+1
else:
word = "The "+word[0:j]
for line1 in keys:
p = line1.strip().split('\t')[0:]
name = p[0]
num = len(name)
i = 0
for letters in range(0,num):
if(name[i]!="("):
i = i+1
else:
length2 = i

movie = name[0:length2-1]
#print movie
if(word == movie):
output.write(line1)
print line1


output.close()

This code took the word, got rid of the newline char, made sure that the word "the" was in the correct spot ( i did the rest of the comma words by hand as there were not many, I just had to make the the title was in the correct order, not like "Punisher, The" by like "The Punisher"). Next it got the name of the movie, the file with the keywords had the date in parenthases after the movie, so I had to remove that, once removed I would compare the movie title, if it matched, I grabbed the line and wrote it to a a file.

This process took 8 hours. This is because the keywords master file was over 2 million lines long and it had to be run through as many times as there were movies. First I tried running it normally and judging by how long it was taking to do one movie I calculated it was going to take somewhere around 24 hours. So I cut the movie file into 3 equal parts and ran the program in three separate instances, one on each core leaving one for the operating system. Like this it finished in 8 hours.

Now that I have my text file with a column of name and a column of keywords I needed to change the format to by like this:

Movie name,keyword,keyword,keyword,keyword,keyword

So I wrote the following program to do that:

key = open("keylist2.dat")

keys = key.readlines()

key.close

output = open("temp.dat", "w")

word = ""

temp = ""

keylist = []

for line1 in keys:

bool = "true"

p=line1.strip().split('\t')

word = colnames[0]

for column in p:

if(temp == ""):

temp = word

elif(temp == word):

if (column != colnames[0]):

if (column != ""):

l = column

length = len(l)

for item in keylist:

if(item == column[0:length]):

bool = "false"

if(bool == "true"):

print column

keylist.append(column)

output.write(column[0:length])

output.write(',')

else:

temp = word

Now that the data was in a more manageable format I needed to extract all the keywords from all the movie, I used this data file to look at each keyword one at a time, if it was not already added to the keyword array, I added it, and then wrote these out to a file. Next I compared each movie to this list of keywords. I would look at the list of all the keywords, and if the movie had that keyword it would put a 1 in the column, and if it didn't it would put a zero. This created a binary file that I could cluster on. The code for this looked as follows:

key = open("temp.dat")

movie = open("movies2.txt")

movies = movie.readlines()

movie.close()

output = open("temp2.txt", "w")

word = ""

temp = ""

keylist = []

for i in range(0, 12135):

temp = key.readline()

length = len(temp)

#print temp[0:length-1]

keylist.append(temp[0:length-1])

for t in range(0, 1076):

print t

output.write('\n')

colnames=movies[t].strip().split('\t')[0:]

p=movies[t].strip().split('\t')

word = colnames[0]

for column in p:

for x in range(0, 12135):

if(column == keylist[x]):

output.write("1,")

else:

output.write("0,")

key.close()

The file it created would look like the output below, but with many more keywords (columns)


Now that both data sets are clustered its time to look at the differences. First, the genre data set had 8 columns, the keywords data set had over 500. That is a very large difference in data pools to conduct the clustering. In general it seemed the more columns that were there the more accurate the clusters were. The only exception to this seemed to be when the movie had one or two keywords, it may not be clustered as neatly as its neighbors.

One thing to note is the share of the dendograhms. The genre picture is very strait and general, while the keywords file is wavy and erratic. This shows the amount of variance between the 2 clusters. To me is says the keywords cluster was much more specific. The pictures are shown below:


Now to look a a few examples
From Genre:

From Keywords:

When looking at this the first thing you notice is that neither is really wrong. But it is easy to see that the keywords picture is more specific, and much more accurate in matching movies. These particular pictures were centered around Scarface.

One more:
From Genre:

From Keywords:

This example show the advantages of keywords even better than the last. Not only are the movies clustered much better, but, as evident by the shape of the clusters, the keywords cluster is much more specific. In fact you almost never see any 2 movie in a cluster together, for this to happen they would have to be almost identical in keywords, and the odds against this are fairly high.

To wrap things up, keywords are a better substance to cluster movies by for a few reasons. The abundant amount of columns gives it more to cluster by and is therefore more accurate. Genre only has 8 variables, leaving the chance of 2 movies being the same too high. Also the specific nature of the keywords keep them in close proximity to like movies. All in all, keywords proved to be an excellent way to cluster movies, it gave a very large pool of words that gave the data set huge diversity while maintaining a common base of comparison.

Saturday, April 25, 2009

Portfolio 8

Our group presented on optimization for our group assignment.

Optimization is often used to find the best solution to a complicated problem. This is often done by running multiple scenarios that a slightly different than each other to find a good solution. This is done becuase often there are too many solutions to try them all, so an algorithm is used to try and modify the solution slightly each time, and if it gets better, that modification is kept, if not it is discarded. This will find the best possible solution given the problem of there being too many solutions to try them all.

One method of doing this is called random searching. This is done by generating random guesses for each possible solutions. The best guess are kept and used as a base of comparison for future guesses. The advantage of this algorithm is that it is very simple.

A variation of this method is holl climbing, this is when a random value is picked at a starting point, and is looks at the neighbours to find the next best match. One problem with this method is that it can get stuck in a local minimum which is isolated by locally high (or good guesses) values that are preventing it from reaching a global maximum. This can be though of as a valley with 2 peaks, the algorith gets it to the top of the highest of the 2 peaks and decides that it has found the best match, but in reality there is another peak that is higher somewhere else in the dataset.

Another variation is Simulated Annealing, it is much like hill climbing, but it makes multiple random guesses in multiple areas of the data, giving it a better change of finding the global best guess.

The next method is genetic algorithms, these work by making an initial set of random solutions, the best ones are kept, and the rest are mutated by mixing them randomely, if one is better it takes its place as the best. This solution is very effective becuase it has to be run less times, and the random mutations ensures vastly different guesses at each iteration. It is generally considered a very good method to find the best solution to a problem that has too many solutions to try them all.

One thing to remember when doing optimization is that you never find the actual best solution, or if you did you have no way of knowing it, what you are finding is usually very close to the best solution though. Also, not ever problem is solvable, it has to meet some minimum requirements. It has to have a predefined cost function, and each solution must be similar. Even with these requirements met, the problem may not be solvable.

One last thing to keep in mind is displaying the data. As with all data the visualization has to be a healthy mix of human understandable format, and enough detail to still be relevant. The algorithms described before do a good job of finding the solution, but displaying it can be a whole other issue. It has to be human readable and still retain its relevance. Below are 2 example of the same data:

Bad


Good


These 2 pictures show the exact same data, but the second is much more human readable. This because the lines are not crossed. Keeping lines from crossing is the most common way of making data easy to read. This is done by counting lines, keeping track of lengths and positions and using an algorithm to make sure that none of these lines are crossing using this data. Ironically enough, this is very often done using a genetic algorithm. Also, to keep the data from being oddly distanced, min and max line lengths are often declared. This keeps thedata easy to read and it also keeps it auto genereated, which is important when dealing with very large data sets.

Portfolio 7

The first thing I did was read through the chapter to get an idea of what it was about. I then got the code all input and ready to go in my IDLE session.

When I went to run it I got a message telling me that I did not have pysqlite. So being in ubuntu I opened terminal and typed "sudo apt-get install python-pysqlite2" which took care of that problem.

The next time I went to run it I got a message telling me I did not have beautifulsoup, so once again I opened up terminal and typed "sudo apt-get install python-beautifulsoup" which took care of that too.

On a side note, using apt-get to organize my lib modules in python is a much easier way to install and remove modules.

Now that all the required modules are installed I can finnaly run the program, so I input the prompts as the book said to and this is what happened:

>>> import searchengine
>>> pagelist=['http://kiwitobes.com/wiki/Perl.html']
>>> crawler=searchengine.crawler('')
>>> crawler.crawl(pagelist)
Could not parse page http://kiwitobes.com/wiki/Perl.html
So obviosly that page will not do. I decided to first check the errata for another page. Nobody there mentioned having a problem, so it must be a relatively recent event, so I tried another site.

I tried a few other websites, but it still did not work. One difference is that istead of saying "could not parse" it said "could not open." I even tried other pages on the same domain, still the same result.


Even though it did not work, I can see by looking at the code and reading through the chapter what it was supposed to do. It goes to the page and runs the indexer on every link found on that page and so on until it runs out of links. It looks as though it would be done recursively. This parsing of all pages indexes them much the same way a search engine would.

Wednesday, April 8, 2009

Portfolio 6

For this assignments I had to cluster a large data set of movies in any way I saw fit. I took the data from IMBDs database of movies and decided after doing some tests to cluster by genre. I did this for 2 reasons, the binary true or false that determined if a movie fit a genre would allow movies to be part of multiple genres, subsequently providing good clustering results (hopefully), and it would also give the user something they can easily look at and determine if the clustering worked or not.

The first thing I did was filter the movies, the data set had thousands of movies in the list and I only wanted about 2000. But I did not want any 2000. I wanted the 2000 most popular movies so that I could look and see if these clusters make any sense once I got that far. So I ran a filter on the text file that took the movies with the most votes (I assumed that more votes = more popular, or at the very least more well known, which achieves the same result as far as I was concerned). In this case it was the movies with over 10,000 votes. The resulting data set was only a couple dozen over 2000, my desired goal.

The filter program is shown here:
def filterit2(self, ifilename):
# this will go through the movies.tab file and only save US movies
# in addition it will add to the dictionary the genre values
self.movieMatrix = {}
ofile = open("US22.txt", 'w')
logfile = open("log.txt", 'w')
ifile = open(ifilename)
line = ifile.readline()
# remove header
line = ifile.readline()
while line != '':
components = line.split('\t')
if len(components) > 23:
title = components[0] + "//" + components[1]

votes = int(components[5])
if votes > 10000 and title in self.movieList:
values = ""
for i in range(17, 24):
values = values + components[i].strip() + '\t'
logfile.write(title + "\n")
ofile.write(title + '\t' + values + '\n')
self.movieMatrix[title] = values
line = ifile.readline()
ofile.close()
logfile.close()
ifile.close()

Now that I had my data set it was time to trim the fat. The first thing I did was take the data into excel and name the columns. Then I deleted all but the genres and the movie names. Then as per usual I copied the data out of excel and pasted it into notepadd++ and saved it as a text file.

I ran the clustering algorithm on the data and at first I was convinced that it was not functioning. It almost immediately froze python and did nothing. After trying my luck with some debugging I realized that it was not frozen, despite the "not responding" message I was getting. Python was contusing to use more and more ram, until about a half an hour had passes and they began to free up the memory from python. About 15 minutes after the memory began to free up it finished.

I think it is important to note here that it took an entire 15 minutes. This is a combination of the number of movies being clustered and the amount of attributes it is being clustered by. When I did a data set of 1000 with the same attributes it took only 5 minutes, leading my to believe the work load on clustering has a very high curve or is exponential.

After it finished clustering I printed my results and to my surprise they were almost always very accurate. When looking at the data I noticed that while there were the occasional anomalies, most of the data ended up in pretty specific categories, not broad like "action" or "romance" but very specific like "Comedy, action, law enforcement movies" such as the example posted below:

Last Action Hero
-
-
Last Boy Scout, The
Lethal Weapon
-
Lethal Weapon 2
Lethal Weapon 3
-
-
-
Pirates of the Caribbean: The Curse of the Black Pearl
Psycho
-
Rush Hour
Rush Hour 2
-
-
Shanghai Noon
-
Starsky & Hutch
Team America: World Police

There is a movie in this example that does not fit completely, pirates of the Caribbean, its action and comedy. But not law enforcement, at least not predominantly in the plot. But the rest of the movies fit very accurately, even matching movies with their sequels, with nothing to go on but genre. This result is duplicated over and over again in my output.

Also the movies, showing clustering by their tabbing, are more closely related to their closer tabbed neighbors than the ones that are farther away

Sunday, March 1, 2009

Portfolio 5

This portfolio was relatively easy since most of what we had to do I had already done for the previous assignment.

The first thing I did was search the web for cool visualizations. I noticed when looking around that there are a lot of different ways to look at data.

When looking at the different visualizations what I learned is that just because something is visual does not necessarily mean that it is accurate or better. On the other side of the coin, just because a data set contains a lot of data does not mean that it is more precise or accurate than a data set that contains less data.

For very flashy visualizations, this can be hit or miss. Some flashy visualizations do a great job of taking complex data and turning it into something that the average person can see and understand. This seems to normally be accomplished through comparisons with other parts of the same data set, whether it be by size of bubbles, bars or lines.

The following link shows a bad visualization.
https://agora.cs.illinois.edu/download/attachments/17831/badvis3.jpg
This is a bad visualization because there is no grid or labels or any markers of any kind. Even if a visualization is accompanied with text or speech, it should always have the scale and axis labeled.

A good visualization balances the visual and the data. It does not try to distort the data or decieve the viewer, it strives to show, in an easy to read manner, the data as best it can. This involves showing all the data, yet keeping concise enough to read and understand quickly.

A good visualization can be found here:
http://www.time.com/time/covers/20061030/where_we_live/

this map shows population levels by area with an easy to see spire. The spires can be compared very quickly to give the viewer a scale with witch to make assumptions from the data.



The data set I made is from the US census. I found data for population projections based on age in the US. It has subdiveded age into different catagories and showes the rise and fall for each age group as projected in the different time periods. I did have to take it into excell to trim out overlapping data, giving it the clear cut age catagories seen in the data.

It can be found here:
http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/population-projections-by-the-us-cen

Saturday, February 21, 2009

Portfolio 4

First I read through the chapter. I ran the sample data to get a feel for the software and to familiarize myself. I am not going to go into much detail on that as it was simply plug and play.

I then went out on to the web to find a data set of my own to run through the code. I found a set of data on the amount of computer usage of 15 year olds around the world. It can be seen Here.

Its not the only data set I ran, but I am choosing to blog about this one because its small enough to easily read quickly (unlike the very large sets I played with previously) and because I found it interesting.

Getting the data set ready took a little time to figure out. First I would copy the text off the website, then I would paste it into notepad++. I did this because it keeps the tabbing, which excel will not do if you paste it directly from the internet. Then I simply copied it out of notepad++ and pasted it into excell, just to make sure the columns were correct. This was also a good place to edit the row and columns if nessesary. Then I simply pasted it into a .txt file and saved it.

Next I ran the clustering code. It did a good job clustering the data, I felt like it backed up the assumptions I had made when I looked at it. The output is below:

>>> clusters.printclust(clust,labels=row)
-
-
Mexico
Hungary
-
-
-
-
United States
Belgium
-
Denmark
Korea
-
-
Australia
Sweden
-
Iceland
Canada
-
-
-
-
Japan
Russian Federation
-
Turkey
Greece
-
-
Poland
Ireland
-
Slovak Republic
Czech Republic
-
-
-
OECD average
Italy
-
Portugal
Finland
-
-
New Zealand
Austria
-
Switzerland
Germany
And here is the Dendogram:


Click to make larger

When looking at both, the first thing I noticed is that the text output has more detailed clusters, or so it seems. The countries appear to be grouped together according to how much they use the computer, as well as where they use it.
This is the visual of clustering the columns of data. The clustering of Rows had 2 categories: at home and at school.
It put Hungry and Mexico into their own cluster because they are the only 2 countries to use the computer more at school than at home. I was hoping to see more clusters, but I do understand why it did it this way. This is also the drawback of a smaller easier to read data set. The larger data sets I have run had more clusters, but were very hard to read quickly.

I ran another data set, which was a little more complicated to illustrate the process again.

This data set compares beer consumption in different states by looking at the factors: Wine consumption, education level, temperature and Obesity Rate for each state. The clustering program did an excellent job of creating clusters of states by which variable seemed to effect beer consumption the most. the data set can be found here, and the output it below:
>>> clusters.printclust(clust,labels=row)
-
-
-
-
Connecticut
-
New Jersey
-
Maryland
New York
-
-
Rhode Island
-
-
Illinois
Oregon
-
Washington
-
Kansas
Virginia
-
California
Utah
-
Massachusetts
-
Minnesota
-
Colorado
Vermont
-
-
-
-
Maine
-
Michigan
Pennsylvania
-
-
Arizona
Delaware
-
New Mexico
-
Idaho
-
Iowa
Nebraska
-
-
Florida
-
Oklahoma
-
Georgia
North Carolina
-
West Virginia
-
-
-
Alabama
Arkansas
-
Mississippi
-
Indiana
Tennessee
-
-
Louisiana
-
South Carolina
Texas
-
Kentucky
-
Missouri
Ohio
-
Nevada
-
-
Wyoming
-
South Dakota
Wisconsin
-
North Dakota
-
Montana
New Hampshire

click to make larger.

It is easy to see that this data set had more variables determining the level of beer consumption, this led to more clusters. The computer use data set had less clusters because it only told you the end result, it did not give any variables that may have effected computer use.

Many-Eyes

Next I created a data set to visualize at many-eyes.com. I went to w3schools.com and got the statistics of users visiting their website. The statistic I decided to use is OS usage. I copied the data and took it into excel to clean it up. Then I uploaded it to many-eyes. I then visualized a stack graph so you could use see the OS usage over time all together, as well as look at them individually over time. The visualization is found here.

Thursday, February 12, 2009

Portfolio 3

I had a lot of fun with this particular assignment. Our group first met last sunday to get started on this. We decided to do it as a group since none of us knew really what to expect. We download the python library and installed it into the Lib directory of python.

The first thing we tried to do was make an artist object, which is where we ran into our first problem. The object required 4 inputs, the bandname in the form of a string, an api-key, a secret key and a session key. This is what took us a while to figure out. Since there was no documentation at all, we had no idea where to get these things. But after a while of surfing through last.fm's website in the api section we came across 2 long strings that resembled hax that claimed to be what we needed. So we plugged them into the api-key and the secret key. That still left the session key, which we set to null. Our artist object was created. This was the only real pain we encountered. We assembled a quick program where our artist object could get similar artists. At that point we went our seperate ways to do the rest on our own. I refined my python project a little bit by adding a similar tracks feature. The entire code for my python program is below:
import pylast
key = 'b8a9a83c0e60d30d237eea0dcdcf055a'
secret = 'a0a206e56d580a78e8715155106371fa'
sk = ''
bandName=input('Please enter band name:')

artist = pylast.Artist(bandName, key, secret, sk)
track = pylast.Track(bandName,"45", key, secret, sk)
similar = artist.get_similar()
track2 = track.get_similar()
me = pylast.User("trumpetporvida", key, secret, sk)
me.compare_with_user(


print similar



tracks = artist.get_top_tracks()

print "\n"

print tracks

print "\n"

print track2


It works well for what it does. I was talking to Dylan about how we had finished when he mentioned he wanted to do it in c# on visual studio. This peaked my interest, I had use visual studio c++, visual studio web developement and visual basic, but never c#. Since I knew how to access last.fm (having the keys now), and since I had not gotten very far with my python version, I thought making this project with c# would be a great way to learn the language.

After fiddling with the forms for a while and finding where everything was I got started. The forms were easy to interact with, using dot notation to access thier methods was very intuitive and setting up the GUI took only minutes. So started to add code to my project. I am not going to past my entire program into this blog becuase of its length, and also becuase almost every fuction is a variation of a common theme, but I will post the function that get similar artists and similar tracks.

Similar artists:
if (textBox1.Text != "")
{
if (comboBox1.Text == "Similar Artists")
{
webBrowser1.Visible = false;
pictureBox2.Visible = false;
richTextBox1.Clear();
string temp = textBox1.Text;

Lastfm.Services.Artist tmp = new Lastfm.Services.Artist(temp, session);
foreach (Lastfm.Services.Artist a in tmp.GetSimilar(50))
{
richTextBox1.AppendText(a.ToString() + "\n");
}
}
}

Ignoring the parts of the code that deal with the interface, you can see that the code simply creates an artist type and calles getSimilar(). Then it prints them all in a for each loop (a cool loop I had never seen before, Dylan showed it to me).

Similar Tracks:

webBrowser1.Visible = false;
pictureBox2.Visible = false;
if ((textBox1.Text != "") && (textBox2.Text != ""))
{
richTextBox1.Clear();
string temp = textBox1.Text;
string temp2 = textBox2.Text;
Lastfm.Services.Artist artist = new Lastfm.Services.Artist(temp, session);

Lastfm.Services.Track track = new Lastfm.Services.Track(artist, temp2, session);

foreach (Lastfm.Services.Track a in track.GetSimilar())
{
richTextBox1.AppendText(a.ToString() + "\n");
}
}
Not much different than before, but this time you can see that to make a Track type I had to make an artist type. Then using Track I could call getSimilar() and print the results.

While every function had its fair share of differences they all followed the theme of: Make object, call member function, display results to gui.

A screenshot of my GUI is seen below:


(Click for larger Image)


The dropdown box next to the artist input holds all the functions related to artists, there were too many to fit individually.

My program can do the following:
  • Display similar artists
  • Display top tracks of an artist
  • Display top albums of an artist
  • display the artist bio web page
  • Show similar tracks to those the user enters
  • get the album photo of an artist and track and display it
  • search common tags
  • search artists
  • display the users with the most in common with a user name untered
  • calculate a similarity score of 2 users
I really did enjoy this assignment, and I will probably use this program for myself in the future.
If anyone would like it, I can export it into an executable that installs it on your start menu.