GPT vs Claude in analyzing tweets (part 1)

I have a cute little script that hides garbage tweets from your feed using GPT, so you can have even a “For You” feed completely free of politics or toxicity.

Hiding tweets

But it’s been struggling with some people’s filters and twitter feeds, so I’m tweaking the prompts and at the same time want to see how GPT3.5, GPT4 and Claude3 would handle the task.

So I grabbed a few tweets and classified them and then … pretty much let Github Copilot write the script for me. Aside: I don’t know if this saved me any time at all, I think I spent longer debugging it than it would’ve taken to write it, but amazing that it worked at all — special thanks to Helicone for caching1, since I was re-running a lot.

To be honest, the goal was more to play with LLM-written code in a greenfield project — but I do want to get some rough answers out of this, so I pasted everything into a Google Doc. Then asked GPT to format it in Markdown.

So across the (small) sample set, here’s the total error by LLM and topic:

SportsPoliticsAmerican PoliticsAdsRudenessAverage

Some preliminary results:

  • I honestly thought ads would be easier. Maybe I’ll just hardcode that.
  • GPT4 beats GPT3.5 and Claude’s Opus beats Haiku as you’d expect
  • Maybe the poor result for “Politics” is that Rishi Sunak has only been PM for 18 months, but that’s past the cutoff date for all the models

Next steps:

  • Currently I use GPT3.5 — the worst performer of the group — for AI Helper Bot. I might rethink that, but it didn’t get completely trounced and GPT4 would be slower and Claude would mean re-writing code so I’m going to hold off.
  • I’m going to recycle the code to compare some optimizations of the prompts (which will be a lot less interesting)
  • Yeah, lemme hardcode ads, give the AI one less thing to chew on.


  1. Copilot struggled with Helicone’s library, maybe that’s because it’s new, maybe I explained it badly. Either way, I had to do manually — by which I mean copying it from the docs.