For service in gopher, dial 70.

July 10, 2008

Using AWK to convert CSV to XML

Filed under: awk, programming — Dave @ 10:52 am

I needed to convert CSVs to XML, so it’s time to return to my text processing hero: AWK.

CSV to XML gets me 90% of the way there, except, it doesn’t actually convert from Comma Separated Values to XML, it uses space to separate values. Updated script:

BEGIN {RS = "\n"
FS = "," }
NR == 1 {for (i = 1; i <=NF; i++)
tag[i]=$i
print "<" node "XML>“}
NR != 1 {print ” <" node ">”
for (i = 1; i <= NF; i++)
print " <" tag[i] ">” $i “
print ” “}
END {print ““}

Notes on cleaning your data:

  • our column headings need to be sane, ie: no spaces or weird characters

  • Numbers with commas, ie “12,231″ won’t work

April 7, 2007

Awksomeness (part 2)

Filed under: awk, programming, statistics, R — Dave @ 8:39 pm

I expanded the program from earlier to include special cases related to the data at Rate My Prof:

BEGIN {
 s=""; FS="n";
 print ("last,first,department,votes,quality,ease");
}
/<td/ {
 str = $1;
 gsub(/<[^>]*>/, "",  str);
 gsub(/[t ]/, "", str);
 if( length(str)<40 && length(str)>0 )s=(s str ",");
}
/<tr|<TR/ {
 sub(/,$/, "", s);
 gsub(/&nbsp;/, "0", s);
 gsub(/,,/, ",", s);
 if(length(s)>0) print s; s=""
}

In R:

> uw<-read.csv("c:/newsite/articles/ratemyprof/marksuw.txt")
> plot(uw$ease,uw$quality, xlim=c(1,5), ylim=c(1,5))

Quality vs Easiness

And the first result is that a professor’s quality and easiness aren’t strongly correlated.

Actually, here’s a more honest graph:

> uw$quality2<-uw$quality+runif(length(uw$quality), min=-.05, max = .05)
> uw$ease2<-uw$ease+runif(length(uw$ease), min=-.05, max = .05)
> plot(uw$ease2,uw$quality2, xlim=c(1,5), ylim=c(1,5))

Quality vs Easiness 2

Looking at the distribution of “quality” marks:

Original Quality Distribution
The data isn’t normally distributed — not even close (the average is 3.4), and if a prof has only one vote then that vote really skews them far more than it should (a prof with 50 votes averaging 4.5 is probably better than a prof with a single 5). So I’m going to multiply the distance from the mean by the root of the number of votes:

> qual<-mean(uw$quality)+((uw$quality-mean(uw$quality))*(uw$votes/10)^.5)
> hist(uw$qual,breaks=c(20))

Modified Quality Distribution

Much nicer. Except there’s still one prof originally rated 2.3 — but 198 times who gets slaughtered down to a -1.5. Maybe we don’t need to worry about a few edge conditions.

April 2, 2007

Awk is Awksome

Filed under: awk, data, programming — Dave @ 8:28 pm

I wanted to analyse some of the data from Rate my Professor, and there’s no easy “download as CSV button” so I was going to have to screen scrape.

Awk to the rescue

I use Gawk for Windows and Wget. The awk code to turn a HTML table into a comma seperated file is (1.awk):

BEGIN {s=""; FS="n"}
/<td/ { gsub(/<[^>]*>/, ""); s=(s ", " $1);}
/<tr|<TR/ { print s; s="" }

And then you execute it as: gawk -f 1.awk *.jsp > marks.csv

That’s it, well almost. As always border cases take up most of the code, I’ll post the longer version, and the R-code later.