We put four top vibe coding tools to the test to find out which one is the best | Track Meet Part 2

Ilan (00:00)
David, maybe before we jump into the tools, do you want to tell us a little bit about how we evaluated each of these?

David Vuong (00:06)
That's right. That's right.

Ilan (00:09)
That's right, we did evaluate them.

David Vuong (00:12)
That's right. Good job.

Hey everybody, welcome back to Prompt and Circumstance, My name is David. And today it is part two of the track meet.

Ilan (00:19)
And I'm Ilan.

David Vuong (00:34)
All right. The long wait is over and it's time for part two of our roundup of four vibe coding tools. and Bolt. All right. In the previous episode.

We saw the 100 meter dash, which was a test of speed that Bolt won.

We then saw the long jump, which was a test of depth and precision, which Replit won. And we will be deploying the winner of each of these events. Now, if you're interested in the prompts, leave us a comment and we will DM you with the link.

All right. Before we get into the pole vault, Ilan, what are your thoughts so far on all of these events?

Ilan (01:14)
I've been surprised at the spread of the scores across all of the tools. I was expecting them to be maybe a little bit more clustered, but in both of the events that we've seen so far, we've seen some really high scores and some really low scores. So that's been something that I'm keeping an eye out as we go into the last event. I think we should also remind folks that

If they haven't watched part one of the track meet, pause this episode right now, go into your feed, listen to the previous episode, or if you're on YouTube, watch the previous episode, because that will catch you up to where we are today.

David Vuong (01:55)
That's right. You know, one of the things that I also noticed was that.

Even the worst performer produced something that was quite good. You know, they were all reasonably good, especially given how quickly they were able to produce it. You know, even when we did the a hundred meter dash, uh, the last place, contestant Replit it was still only a few minutes that it took in order to produce something that was quite good that again, a human would never be able to match the pace of.

Ilan (02:26)
That's true, David. I heard once a comedian talking about how when wifi was first available on planes, it was like, my God, this is amazing. And then within six months, it was like, the wifi is not working on the plane. Like you're in the sky and you have internet.

David Vuong (02:45)
Yeah

Ilan (02:45)
That's kind of the bar that we're setting for these tools. It's like, can you do this thing fast that six months ago I couldn't have even imagined a tool doing at all?

David Vuong (02:54)
Yeah, absolutely. It's, it's phenomenal. The change that we are seeing here.

Ilan (03:00)
Well with that, do want to take us into the pole vault?

Ilan (03:03)
But first let's hear from one of our sponsors

Do you have a side hustle? Why not? If you're worried about validating your market or finding time to build your product, then Co.Lab's Validate and Build will help you out.

They'll go out and ensure that users really want your product and then they'll build an MVP so that you can have your first paying customers. They'll hand off the product to you and you can take it from there so you can run your side hustle.

link below to find out more and let them know David and Ilan sent you for $250 off.

David Vuong (03:37)
Yeah, all right, let's get into it. So the pole vault is about testing how well these tools can handle complexity. We are going to give it an enormous prompt. It's going to involve not only logic that it needs to figure out, but also calling APIs, doing integrations with other systems.

Ilan (03:57)
In fact, it was so complicated, I think I had to read the prompt like four times before I understood what was going on.

David Vuong (04:04)
All right, well maybe that's a hint for me to explain what exactly we're doing with this prompt.

Ilan (04:10)
Yeah, let us know a little bit about that, David.

David Vuong (04:12)
All right. So one way that these models are being tested is in terms of benchmarks.

And one of the benchmarks that's really interesting to me is this thing called the graduate level Google proof Q & a. And what this does is it tests the models on a variety of graduate level questions from science, technology, engineering, and mathematics fields.

So these questions are posed as multiple choice questions and boy, are they complex. And what the model needs to do is correctly guess what is the right answer.

And you can see on this page in terms of how these models are performing.

Not only do you see their performance overall in terms of percentage accuracy, like here we have Claude 3.7 Sonnet achieving a mean accuracy of 79%, but you also see the compute that it took to get there. So you can see here that ⁓ Grok, although it did all right with 68%, it took a great amount of compute in order to get there.

So what we will be doing with this application is we are going to be retrieving a question from the GPQA diamond dataset and then asking the LLM what the answer is going to be to see just how exactly it answers.

Ilan (05:36)
Gotcha, so what I'm hearing is that Grok likes to turn on the microwave with nothing in it just to get a little bit of a better response. ⁓ But some of these like Claude are a little bit more efficient, maybe they just nuke the thing for 30 seconds and then eat it.

David Vuong (05:57)
That's right. Yes. A ⁓ warm on the outside and frozen in the middle burrito.

All right, so what we'll be making today is an application that will allow a user to provide the API keys for these integrations so that the application can retrieve that question and then run that question against the LLM and then check to see whether the LLM had responded correctly.

Also included in is the ability for saving of the history of responses because we want to see how these LLMs do over a variety of responses.

Ilan (06:31)
David, before we launch into the tools, do you want to tell us how we evaluated them?

David Vuong (06:36)
And so here are the 10 criteria that we will be evaluating these tools on.

Number one, it needs to save the API tokens and hide them. Number two, it needs to be able to fetch a random question from the GPQA. Number three, it needs to be able to execute that question against the selected LLM. Number four, it needs to disable the run button while it's waiting for a response.

Number five, it needs to show the response from the LLM correctly on the UI.

Number six, it needs to save the history of what the AI responded with. Number seven, it needs to be able to allow the user to remove the token after they've saved it.

eight, if there's a problem with the token, it needs to surface that to the user in a toast.

Number nine, it needs to be able to support the width that a mobile phone would reasonably have. So something like just a few hundred pixels wide, and it needs to be able to do that without any horizontal scroll bars.

Number 10, in addition to supporting the mobile width with no scrolling horizontally, it needs to ensure that there's no text overlapping in that mobile setup.

Ilan (07:53)
And David, how did we score these against those evaluation criteria? Because LLMs, as we know, are probabilistic, so they don't always come out with the same response each time.

David Vuong (08:06)
That's right. We gave them three attempts overall to accomplish 10 out of 10 on this rubric, just like with a pole vault. And with each attempt, we gave them four fix opportunities where each additional fix actually had a penalty. So if it one shotted at it, then no penalty. If it needed a fix, then that would be a half point penalty per fix.

So if it needed four fixes in order to get to a working state, then that would be a penalty of two points.

Ilan (08:39)
right, so we've got a pretty high bar for these tools to clear. So let's get into it and see how they did.

All right, so let's start off in v0. This is the last version of the tool that it created. And what we see here is we've got our main page where we can evaluate our runs. We also have the ability to update the prompt template.

and we have a history.

And we can go to settings and we can see that the credentials that we've put in are saved. We can also see that the tokens are hidden.

And the storage is also encrypted. So the tokens are not available in the storage. And actually for v0, that ended up being an issue. I don't know if you can see it in your screens, but one fix that I had to ask

v0 to make was it couldn't decrypt its own ⁓ decryption encryption from the the background so ⁓ it was creating a new key each time so that was the the one fix that ⁓ it had to do and after that it ended up with a 10 out of 10 so it hit all of the criteria so let's give it a try here and show what it does

David Vuong (09:41)
you

Let's see it in action.

Ilan (10:03)
So here we got one question from the GPQA data set,

I'm sure you already have the answer in your head.

David Vuong (10:10)
When in doubt go with C.

Ilan (10:14)
But ⁓

David Vuong (10:15)

Ilan (10:17)
that's right. But we're going to try and run this against an LLM. We have set this up with gpt-4o-mini And so let's click there and see what happens.

so this is telling us that it gave the correct answer. It gave the response B and it also had the rationale for that response.

David Vuong (10:39)
Excellent. You know, one of the things that our audience ought to know is that part of the response from Hugging Face for the GPQA question is also the answer. What is the correct answer? And so the UI here has a button to let you show what the correct answer is.

Ilan (10:56)
So we can confirm that gpt-4o-mini actually did get this correct. I did check in the call to the OpenAI API and the call is not providing the answer.

And on the other side, it is reproducing the rationale word for word with what was sent back by OpenAI. So this is what we got.

David Vuong (11:19)
Outstanding. And,

and it stores it in history too, does it?

Ilan (11:23)
That's right, so we can then go to the history and we can see that we had two questions here. This is the one that we just opened up and we can see the question, the options and the response.

David Vuong (11:37)
And a ⁓ beautiful UI too. You know, it's minimalistic, but it certainly works.

Ilan (11:42)
Mm-hmm.

David Vuong (11:44)
So how did v0 do in the end?

Ilan (11:47)
v0's top score across three attempts was a 9.5 and that was on this attempt where it was docked half a point because it needed one fix.

David Vuong (11:59)
the encryption fix where it locked itself out of its own home. ⁓

Ilan (12:03)
That's right.

David Vuong (12:05)
All right, so let's move on to Lovable.

All right, so here is what Lovable ended up creating. It created a very simple page where we do have the settings here for API tokens. So let's open that up. We've got a nice little dialogue. Let's go ahead and put in our tokens.

And now with our tokens put in, we can see that the model it's selected is going to be 4o-mini. That's cool. URL is correct. So now we can go and retrieve a question from Huggingface. Excellent.

So these toasts down here are quite nice as well.

And here we have the GPQA diamond question talking about an aperture which shapes like an n-sided polygon. Let's run this against ⁓ gpt-4o-mini and see what it says.

it looks like 4o-mini shows the correct result C. And we can see the full response here from gpt-4o-mini, which is quite nice. And ⁓ so now we can see that it is going to add it to the history. And you can see that I've had some previous runs where 4o-mini did not have a correct result.

And it's a nice little dialogue we've got here as well, very nicely laid out.

Clicking onto one of the previous runs, we see that it is going to draw the results of the question and answer outside of the modal window, which not correct, but this is not something that we are evaluating. We are only looking at whether or not it does keep that history. So that's all right.

Okay, so this was Lovable's second attempt and it actually got full marks. So it was able to ⁓ save and hide the token. It was able to grab random question. It executed that question and the buttons were disabled when it ran. It was able to pull out the correct answer and got the history right. ⁓ We can come over here and remove the token.

And so if I come over here, remove this API token, like so, I cannot run the Hugging Face query. And then now if we were to shrink this to mobile width, it's going to work. So full marks on the second attempt with Lovable.

Ilan (14:41)
That's pretty impressive. I mean, this was a really detailed PRD that we uploaded as we talked about before. There was a lot to get done here. It's kind of amazing that it one-shotted and got everything right.

David Vuong (14:54)
Yeah, absolutely. And in fact, uh, since it aced the pole vault on the second attempt,

So I gave it the opportunity to flex a little bit with its third attempt. So on the third attempt, I gave it the same PRD. However, I also added in the additional complexity where I told it to make the page styled according to our logo.

And here we go. This actually also aced the pole vault and again it was one shot. So here we have the credentials area laid out a little bit differently but everything here actually works so it's quite nice.

Ilan (15:30)
This is really cool, David, and ⁓ nice flex, Lovable.

David Vuong (15:35)
Yeah, good job.

Ilan (15:37)
How about Replit, How did that do?

David Vuong (15:39)
Yeah, so here we are with Replit and what it created. And you can see here, this is the third attempt. I told each of them to flag which attempt it is. So on the third attempt, it got nine out of 10. So let's walk through what it created. Now this layout is quite nice. It has a bit more color than what Lovable's second attempt had resulted in.

Let's come over here to the API keys

And in this modal window, it's a single window where the user can enter all of their keys. So let's go ahead and get started.

All right, and now with the two keys that we're going to use entered into here, you might notice that there's some nice additional UI in here that was not called for, but a pleasant surprise. Like for example,

Just below the field for the Hugging Face API token, there's a little link that says, your token here. And what a delight that Replit has figured out that there is a link to get your API token.

And you might notice that there's a little security notice up top in the modal where it gives the user confidence that the API keys are going to be stored locally and encrypted. So that's a nice little bonus as well. Let's go ahead and save these credentials.

And now back here, let's get a random question.

All right, Ilan how's your chemistry doing? How's your chemistry skills?

Ilan (17:12)
Yeah, you know, I'm pretty sure I could get this. ⁓ Let's see how the tool does, but I have the answer in my head right now.

David Vuong (17:19)
You have it your head,

right? Yeah. So here's the great question about 336 trimethylhepta 1,5-dien-4-one. And we're going to go ahead and ask OpenAI. And we're going to choose.

what interestingly is going to be a 4o as opposed to 4o-mini.

and it is 4o so it took a little while longer, but nevertheless, here we go. Not only did it take longer than mini also got the question wrong. All right, Ilan, I'm sure that you did not guess C and I have utmost confidence that you would have guessed. A, looks like a was the correct answer. Yes, that's right.

Ilan (18:01)
A!

David Vuong (18:08)
Now over here we have the recent history that the user can use to view previous responses. Now there is a bug in here where when the user clicks on this, it actually doesn't show the question. It actually shows the answer, but the question remains the same. It's the old question. So if only we had evaluated on the fact that it would show the original question, there would have been a point not attained here. However, it is something that's

is stored in the history so it does get points for this ability. Now if we come over here and we remove the Hugging Face token, let's save those credentials, if we attempt to get a random question it's going to not only give me a toast saying that it's missing the token but another UX delight here is that it automatically opens the credentials management

modal so that the user can right away enter in their token. Let's close this and we are going to now size this down to mobile sizing. So as we shrink this screen, it does quite well. Now,

What happens though is that when we scroll down, you can see that because of the choices that Replit made, it actually ends up with elements that leave the card. So this run on LLM button, it unfortunately is drawn outside of the card and that is where it lost the point. Now, no text is overlapping. So it got to full points. Well, the full one point for that criterion.

So, Replit got a 9 out of 10.

I think there are some good things though about what Replit has created. I do appreciate the use of color throughout. I appreciate some of the UX that it has added by itself. That was certainly delightful.

Ilan (20:15)
I do like how it did this. It looks pretty nice compared to some of the other tools that had a pretty muted overall design.

So here we see Bolt. I didn't pull it up in its un-configured state, but this was the best attempt that Bolt put together. So we see here that we have a little settings modal up at the top right where we can add in

our tokens. We can also edit them. They are blocked out in the UI, but interestingly Bolt was not able to encrypt the tokens in local storage like we asked across all three attempts. So it lost points there each time.

David Vuong (21:02)
What did it do? Did it store it in as plain text? my goodness. Okay.

Ilan (21:05)
Plain text. Mm-hmm. Yep.

So we have our settings enabled and we're able to pull a random question from the Hugging Face database.

In our settings, we can see that it's chosen the model for us, gpt-4o. So we're going to save that and we're going to try and run this question against the LLM.

David Vuong (21:37)
And this is another chemistry question it looks like.

Ilan (21:40)
It is testing your chemistry knowledge here, David.

David Vuong (21:44)
Well, I know that it's not B.

Ilan (21:46)
I'm glad that you eliminated that one. So we can see here that the latest response is B and now we've we can see the history of responses and we can reveal the correct answer as well.

element of Weird UX is that you can only reveal the answer in the history. You actually can't do it in the response itself. Nor can you do it when you pull the question.

David Vuong (22:19)
I think we left that up to the LLM to largely decide. And ⁓ that's an interesting difference in design. Speaking of design, I do like the choice of color here. I mean, in addition to the buttons being green and things wrong being red, the choice of a magenta violet gradient and some of the purple that it's got there is quite nice.

Ilan (22:23)
That's right.

If you're watching this, you may recall that in part one of the track meet.

the v0 snake design use the exact same magenta gradient in its page so definitely an artifact of the underlying models that really like this purple gradient

David Vuong (23:08)
I think I have to have a lookout for this purple gradient on websites that I visit, then I know who generated it.

Ilan (23:14)
That's right, it'll be ⁓ one of those things like the em dash in ChatGPT responses.

David Vuong (23:21)
That's right

Ilan (23:22)
All right, so when we go into mobile view, we can see that Bolt did structure the page pretty well. Everything is compact. used flex blocks so that the elements position themselves well on the page. However, when we go into settings, we can see that there is an overlap here. And so we do have to horizontally scroll to be able to access all of the data. And that lost it a point as well.

So overall this was 8 out of 10 and the 5 tries lost it 2 points for the fixes so we got a total of 6 for Bolt.

overall ⁓ Bolt was never able to get a one-shot response. It took five prompts each time for Bolt to get a working

Ilan (24:18)
While the judges tally their results, let's hear from one of our sponsors.

Are you having trouble wrangling too many data sources to get answers to your product questions?

Querio's AI agent sits on top of your data stack and connects the dots so that you can get product insights at your fingertips.

I use it personally and it saves me hours per week. You can try it too. Go to querio.ai and let them know that David and Ilan sent you and they'll give you two months free of Querio.

David Vuong (24:49)
We now have the final scores for all four of these vibe coding tools. So in first, we've got Lovable with a full 10 out of 10. Next we have V0 with 9.5 out of 10 and Replit in third with nine out of 10. So the top three very close race and Bolt rounds it up with six.

Ilan (25:11)
So the thing that I find interesting here is that all three of Replit, Lovable, and v0 were able to get a working product from our PRD on the first or second attempt. Whereas Bolt took multiple attempts each time and that's really where it struggled compared to the others.

David Vuong (25:29)
yeah, I really wonder whether that is something down to the system prompt or if it's due to other factors. Maybe Bolt just had an unlucky day. That happens to pro athletes too.

Ilan (25:39)
One of the interesting things about that though, is we already noted that it's clearly using the same underlying model from some of the styling cues that got repeated between some of the tools. So as you said, even though it's using the same models, just that system prompt or a bad day can really affect how the tool does.

David Vuong (26:01)
All right. So after these three events, let's see how our four contenders stand.

Alright, so we have a three-way tie for most gold between v0, Lovable, and Replit. And to break the tie, we see that v0 is the only one of those three that has a silver. All three of them also have bronze. So v0, congratulations on winning the very first track meet for VibeCoding tools.

So it turns out that because Bolt did not complete the scope of what it was supposed to do in the 100 meter dash, it got

the judges reconsidered the results. So it got a one minute penalty, which brought it from gold into silver.

what gave v0 the extra gold. It gave Bolt the silver and

The rest of the standings stayed the same.

Ilan (26:55)
Makes total sense. I like how those judges were thinking.

David Vuong (26:58)
Me too.

All right, now you might notice that we have a tie for second between Lovable and Replit So, Ilan, what should we do?

Ilan (27:07)
I think we need to have a runoff between them and add a bonus fourth event. What do think David?

David Vuong (27:14)
All right, let's do it. a bonus head-to-head

Ilan (27:17)
So David, for the 200 meter dash, we are going to be pitting these tools head-to-head on building the site to host the results of the track meet. And the grading here is gonna be really, really precise and scientific. How are we gonna grade them?

David Vuong (27:38)
Absolutely.

We're going to grade vibe tools on vibes. Let's do it. So what we're going to do is attach the new PRD that we've got, and we're to tell it to go ahead and make those applications.

Ilan (27:42)
I love it. Alright.

David Vuong (27:57)
All right. And after a few bug fixes, Lovable finally has something ready to show. And it looks like ⁓ we have two very similar looking pages. However, on the Replit side, we have text leaving the box. That's not so, not so good.

On the Lovable side, we have something that looks quite nice. Some horizontal scroll bars. I don't know about that. What do you think, Ilan?

Ilan (28:27)
well David, I'll say one thing which is that Lovable did take longer, had a bunch of errors to fix, but when push came to shove, it did deliver. It delivered only in a little bit longer than Replit did. And the overall UI UX is a little bit cleaner to me. So for me, it's Lovable.

How about you?

David Vuong (28:55)
I feel the same, you know, even though it had some stumbling blocks, those things were easily overcome. And so the overall layout of it where things are tabular like this, and by tabular, mean using tabs as opposed to a table. my vote's with Lovable as well.

So congratulations to Lovable for winning second in the neck and neck tie break in our inaugural track meet.

Ilan (29:24)
What an exciting event. Congrats to v0 on your big win. And that'll wrap up our inaugural track meet.

David Vuong (29:33)
All right, so where do we go from here? You tell us. Are there any tools that we ought to be reviewing? Are there any other track meets that we ought to be running? Let us know.

Ilan (29:44)
Leave us a comment and otherwise you can find us on all the socials @pandcpodcast You can like and subscribe and leave us a five star review. Thank you everyone for listening. See you next time.

David Vuong (29:57)
See you next time.

© 2025 Prompt and Circumstance