Once forked a process in Reno just to watch it die.

April 7, 2007

Awksomeness (part 2)

Filed under: awk, programming, statistics, R — Dave @ 8:39 pm

I expanded the program from earlier to include special cases related to the data at Rate My Prof:

BEGIN {
 s=""; FS="n";
 print ("last,first,department,votes,quality,ease");
}
/<td/ {
 str = $1;
 gsub(/<[^>]*>/, "",  str);
 gsub(/[t ]/, "", str);
 if( length(str)<40 && length(str)>0 )s=(s str ",");
}
/<tr|<TR/ {
 sub(/,$/, "", s);
 gsub(/&nbsp;/, "0", s);
 gsub(/,,/, ",", s);
 if(length(s)>0) print s; s=""
}

In R:

> uw<-read.csv("c:/newsite/articles/ratemyprof/marksuw.txt")
> plot(uw$ease,uw$quality, xlim=c(1,5), ylim=c(1,5))

Quality vs Easiness

And the first result is that a professor’s quality and easiness aren’t strongly correlated.

Actually, here’s a more honest graph:

> uw$quality2<-uw$quality+runif(length(uw$quality), min=-.05, max = .05)
> uw$ease2<-uw$ease+runif(length(uw$ease), min=-.05, max = .05)
> plot(uw$ease2,uw$quality2, xlim=c(1,5), ylim=c(1,5))

Quality vs Easiness 2

Looking at the distribution of “quality” marks:

Original Quality Distribution
The data isn’t normally distributed — not even close (the average is 3.4), and if a prof has only one vote then that vote really skews them far more than it should (a prof with 50 votes averaging 4.5 is probably better than a prof with a single 5). So I’m going to multiply the distance from the mean by the root of the number of votes:

> qual<-mean(uw$quality)+((uw$quality-mean(uw$quality))*(uw$votes/10)^.5)
> hist(uw$qual,breaks=c(20))

Modified Quality Distribution

Much nicer. Except there’s still one prof originally rated 2.3 — but 198 times who gets slaughtered down to a -1.5. Maybe we don’t need to worry about a few edge conditions.

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment

*
To prove you're a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word.
Click to hear an audio file of the anti-spam word