So how does the new GPT model shape up?

Or what could be wrong with 50% faster and cheaper!

May 14, 2024

Most people were wowed yesterday by the human interactions, and if you haven’t seen the YouTube video, I recommend it. However, I’m much more interested in how well it performs in the mundane market research applications I’ve been working on with the team at MX8 Labs.

Overall, the new technology stands out with its 50% faster performance and promising pricing that suggests cost-effectiveness. It surpasses previous versions in most tasks, particularly excelling in non-English language requirements and adopting a more visual problem-solving approach.

Based on their initial confidence in its performance, the MX8 team is considering transitioning some of its workloads to the new technology in the coming week. That’s lightning-fast adoption.

For those interested in the details, dive right in below…

Survey writing

GPT4 was already quite good at writing surveys from scratch, and the new version is producing significantly longer and more comprehensive surveys from a small brief. It collects more detailed demographics and favors more rating-style questions.

Interestingly, it also involves writing surveys more visually, without prompting.

The surveys still need to be edited before fielding, but they do a great job of getting things off the ground.

Survey programming at MX8 is done using Claude because of its stronger ability to produce a longer output, so I don’t have any details about how much better the new GPT is for that use case.

Report writing

This is hard to judge, but it appears to be clearer. Where the old one said:

LinkedIn: Predominantly used by the 25-44 age group, with 95% spending less than one hour daily. Another significant segment (80%) also spends less than one hour, indicating brief but frequent engagement.
Pinterest: Similar to LinkedIn, it sees significant use among the 25-44 age group (88%) and the 65+ demographic (88%), with users primarily spending less than one hour daily.
Instagram: Shows a broad age range of users, with 86% of those 65+ and 40% of an unspecified age group spending 1-2 hours daily.
Facebook: Notable for its 1-2 hours daily usage by 37% of respondents and being the primary platform for 95% of users aged 65+.
Discord, Snapchat, TikTok, YouTube: These platforms have a notable presence among users spending less than one hour daily, indicating a trend towards brief interactions.

The new one says:

LinkedIn:
- 95% of respondents aged 25-44 spend less than one hour per day on LinkedIn.
- 80% of respondents in general spend less than one hour per day on LinkedIn.
Pinterest:
- 88% of respondents aged 25-44 spend less than one hour per day on Pinterest.
- 88% of respondents aged 65+ spend less than one hour per day on Pinterest.
- 74% of respondents in general spend less than one hour per day on Pinterest.
Instagram:
- 86% of respondents aged 65+ spend less than one hour per day on Instagram.
- 40% of respondents spend 1-2 hours per day on Instagram.
- 40% of respondents spend less than one hour per day on Instagram.
Facebook:
- 37% of respondents spend 1-2 hours per day on Facebook.
- 37% of respondents spend less than one hour per day on Facebook.
Discord:
- 48% of respondents spend less than one hour per day on Discord.
Snapchat:
- 47% of respondents spend less than one hour per day on Snapchat.
TikTok:
- 39% of respondents spend less than one hour per day on TikTok.
YouTube:
- 36% of respondents spend less than one hour per day on YouTube.

Synthetic sample

The largest differences here appear in open-end responses. I don’t have enough information to know if it’s really more predictive, but the results seem pretty comparable with what we saw with GPT-4. The 50% speed increase is a big boost for this use case, though.

Open-end coding

The open-ended coding results for English are very similar. Still, we’re seeing better results for translated results, where there was previously a tendency to bucket too many items into an “other” category. This requires more research, but upgrading fairly quickly seems sensible.

Translation

We’re seeing significant differences, and early feedback is that it’s better. A question like “What year were you born in?” was previously translated into Hindi as आप किस वर्ष में पैदा हुए थे? and now comes out as आपका जन्म किस वर्ष हुआ था?

I’m assured that the latter is a better translation for the context of the question. This will take same time to iron out because the need to manually verify which is working better.

Question matching

This is the one area where GPT3.5 seems less performant out of the box. The new version matches significantly fewer survey questions to the correct items in sample provider databases, so this is one to stick with 3.5 for the time being.

Discussion about this post

Ready for more?