Awksomeness (part 2)
I expanded the program from earlier to include special cases related to the data at Rate My Prof:
BEGIN {
s=""; FS="\n";
print ("last,first,department,votes,quality,ease");
}
/<td/ {
str = $1;
gsub(/<[^>]*>/, "", str);
gsub(/[\t ]/, "", str);
if( length(str)<40 && length(str)>0 )s=(s str ",");
}
/<tr|<TR/ {
sub(/,$/, "", s);
gsub(/ /, "0", s);
gsub(/,,/, ",", s);
if(length(s)>0) print s; s=""
}
In R:
> uw<-read.csv("c:/newsite/articles/ratemyprof/marksuw.txt")
> plot(uw$ease,uw$quality, xlim=c(1,5), ylim=c(1,5))
And the first result is that a professor’s quality and easiness aren’t strongly correlated.
Actually, here’s a more honest graph:
> uw$quality2<-uw$quality+runif(length(uw$quality), min=-.05, max = .05) > uw$ease2<-uw$ease+runif(length(uw$ease), min=-.05, max = .05) > plot(uw$ease2,uw$quality2, xlim=c(1,5), ylim=c(1,5))
Looking at the distribution of “quality” marks:
![]()
The data isn’t normally distributed — not even close (the average is 3.4), and if a prof has only one vote then that vote really skews them far more than it should (a prof with 50 votes averaging 4.5 is probably better than a prof with a single 5). So I’m going to multiply the distance from the mean by the root of the number of votes:
> qual<-mean(uw$quality)+((uw$quality-mean(uw$quality))*(uw$votes/10)^.5) > hist(uw$qual,breaks=c(20))
Much nicer. Except there’s still one prof originally rated 2.3 — but 198 times who gets slaughtered down to a -1.5. Maybe we don’t need to worry about a few edge conditions.





