Hosted on a watercooled Vic-20 with 8 gigs of ram.

April 7, 2007

Awksomeness (part 2)

Filed under: R,awk,programming,statistics — Dave @ 8:39 pm

I expanded the program from earlier to include special cases related to the data at Rate My Prof:

BEGIN {
 s=""; FS="\n";
 print ("last,first,department,votes,quality,ease");
}
/<td/ {
 str = $1;
 gsub(/<[^>]*>/, "",  str);
 gsub(/[\t ]/, "", str);
 if( length(str)<40 && length(str)>0 )s=(s str ",");
}
/<tr|<TR/ {
 sub(/,$/, "", s);
 gsub(/&nbsp;/, "0", s);
 gsub(/,,/, ",", s);
 if(length(s)>0) print s; s=""
}

In R:

> uw<-read.csv("c:/newsite/articles/ratemyprof/marksuw.txt")
> plot(uw$ease,uw$quality, xlim=c(1,5), ylim=c(1,5))

Quality vs Easiness

And the first result is that a professor’s quality and easiness aren’t strongly correlated.

Actually, here’s a more honest graph:

> uw$quality2<-uw$quality+runif(length(uw$quality), min=-.05, max = .05)
> uw$ease2<-uw$ease+runif(length(uw$ease), min=-.05, max = .05)
> plot(uw$ease2,uw$quality2, xlim=c(1,5), ylim=c(1,5))

Quality vs Easiness 2

Looking at the distribution of “quality” marks:

Original Quality Distribution
The data isn’t normally distributed — not even close (the average is 3.4), and if a prof has only one vote then that vote really skews them far more than it should (a prof with 50 votes averaging 4.5 is probably better than a prof with a single 5). So I’m going to multiply the distance from the mean by the root of the number of votes:

> qual<-mean(uw$quality)+((uw$quality-mean(uw$quality))*(uw$votes/10)^.5)
> hist(uw$qual,breaks=c(20))

Modified Quality Distribution

Much nicer. Except there’s still one prof originally rated 2.3 — but 198 times who gets slaughtered down to a -1.5. Maybe we don’t need to worry about a few edge conditions.

No comments yet.

Leave a comment