Total Freedom! How to Generate Audio Locally

David (00:00)
Hey everybody. Welcome to Prompt and Circumstance. My name is David. And today we're going to talk about generating audio locally on your machine.

Ilan (00:04)
and I'm Ilan.

On this week's episode, David's been playing around more with local models. He's gonna take you through how he generated music and speech all done on his local machine.

David (00:33)
And by the way, nice shirt, Ilan.

Ilan (00:35)
Thank you! It's available now at DevilWearsProduct.shop. All kinds of product management and developer-themed gear.

So David, people seem to really enjoy the episode about local image generation where we said you don't need nano banana anymore, but you've extended way beyond just image generation, right? So can you tell us a little bit about what you've been doing?

David (00:56)
Yeah, that's right. In addition to no longer needing Nano Banana, I no longer need ElevenLabs for generating voices and I don't need Suno anymore for generating music. Want to see how I do it?

Ilan (01:08)
Yeah, please.

David (01:09)
Okay. So let's first talk about making music.

So the model that we're going to be using is called ACE Step 1.5. And by the way, all of this is going to be using ComfyUI. That's the UI that I'll be using. It's great for running models. If you haven't seen our previous episode, all these open source models, you can run them all on your own machine. You just regular personal machine and ComfyUI is a great way to do that.

Ilan (01:34)
David, quick question. What kind of machine specs do you need to be able to run these models?

David (01:40)
That's a good question because not all consumer grade hardware is made the same. So with this model, ACE-Step 1.5, you can see here on their ⁓ README that all you need is just four gigabytes of VRAM.

So very friendly in terms of being able to do things lightly.

Ilan (01:59)
Yeah, I mean, I think even my MacBook Air has at least 4 gigs of shared VRAM.

David (02:05)
There you go. So ⁓ right, this is made by ACE Studio and ⁓ we are going to link to the repo ⁓ and ⁓ I'm going to jump right over to ComfyUI to see how all this works.

Ilan (02:17)
Awesome, let's see it.

Ilan (02:19)
This week's episode is brought to you by Querio Have you ever found that your data team is bogged down in ad hoc questions from users or that your product team can't quite get the answers that they're looking for out of the Querio's AI agents, it's on top of your data stack and allows you to ask natural language questions to get the answers that you need immediately.

Try today by going to querio.ai, let them know that David and Ilan sent you for two months free of their Explore product. That's a thousand dollar value.

David (02:47)
All right, so here I am in ComfyUI with the workflow specific to ACE-Step 1.5. ⁓ And again, for those who haven't seen ComfyUI before, I'm sure this looks intimidating. Have a look at our previous episode. You'll see that it's really straightforward and you'll see also here that it's just changing a few parameters. All right.

Ilan (03:05)
And

you, basically just download this and load it into ComfyUI, correct?

David (03:11)
That's right. So what's great about this workflow is that it shows you exactly where to put that one file. So here's the file. You put it into this part of the ComfyUI folder and you're good. You just select it here when you're loading the checkpoints. That's it.

Ilan (03:30)
Okay, so there's nothing else to be done. You didn't have to build this whole workflow yourself.

David (03:35)
Nope, is what's great about ComfyUI is that there's a very good community where people will come together and create not only their own workflows, but also those who have these open source models. They will just provide the workflows that they put together specific for their models.

Ilan (03:52)
All right, perfect. So ⁓ looks complicated, looks intimidating, but it's actually pretty simple. Somebody else has already done this work for you.

David (04:00)
That's right, just a few knobs and switches for us to do. Okay, so ⁓ that was step one, as you can see in here. so step two is going to be, well, how long is our song going to be? Let's leave it at this. It's 208 seconds, which works out to be like two to three minutes or so. Not two minutes, more like three minutes.

Ilan (04:03)
That's right.

Seems long for a pop song, David.

David (04:24)
Yeah, right. Yeah, we can make

it shorter if we want. So hey, why don't we make this ⁓ two minutes? How about that?

Ilan (04:29)
perfect.

That's about my attention span these days.

David (04:32)
Your context window is shrinking.

Ilan (04:34)
Yeah, well, while OpenAI's context window grows, my context window shrinks.

David (04:40)
Yeah, that's right. is how they win. Over here in step three, we have the prompt. Now, this prompt is a little bit different from the prompts that people might be used to when working with LLMs. The first part is the one that you might expect, some description of the song. As we discussed last time, it doesn't have to be so lengthy. It can be very brief. We can test that out.

Ilan (04:42)
Ha

David (05:07)
The the second part of this prompt is the lyrics so you can actually have complete control of the lyrics in that song Which is I think is quite quite good You know in addition to saying that some part of the lyrics are the verse. You also have You know labels for the chorus. So you put that into square brackets and what happens is that the ⁓ the model will actually generate ⁓ music

such that the chorus will always sound the same if you repeat the chorus. And I think that that's a really cool little feature of this model.

Ilan (05:37)
Mmm.

Amazing.

David (05:42)
So what I've got here is ⁓ basically it's like a dollar store ripoff of Taylor Swift. So I'm talking about like a little synth pop ⁓ anthem and it's talking about it being sort of bright and using acoustic guitars ⁓ and just a great little pop song. right. OK, so down here.

Just a few more parameters for us to tinker with. We have the beats per minute. We have the time signature. We have the language. We have the key scale. So just all things that you would probably care about if you're trying to make a song. You could leave these to where they are, but it would probably make sense for you to have a bit of an opinion on this. Just like if you're making, if you're vibe coding an app, right? If you just give it a one sentence, okay, well.

it'll do what it can, but you probably have a bit more particular of a taste around that.

Ilan (06:40)
this is something where your mileage can vary if you're somebody who's got an opinionated taste in music, you can tweak those and make them exactly specific to what you have in mind. Or if you're like me and you you barely know what a key scale means, then you could probably just leave it at the default or play around with it afterward.

David (07:02)
Absolutely. And ⁓ look, I like music, ⁓ but I don't know it.

well enough, I'm not a music theory major. And so I did what every other AI native person did, which is defer to an AI model. So everything here that you see is actually generated. So I told Gemini to make this prompt here based off of examples that were provided by ACE Studio. And same thing with the lyrics. I just said, just imagine that you're pop songwriter and write me some lyrics that are suitable for this song.

Ilan (07:36)
Love it.

David (07:36)
Right, so I'm going to actually change the duration of the song back to where it was because the beats per minute multiplied by all the lyrics here, we want it to be sounding a bit natural and not hurried. But we can just listen to the first little bit. So everything else here, I don't need to touch. I don't need to touch anything else here. Everything else is for the data scientists in the room. So I can just hit run.

and this is going to go ahead and generate this song for us.

Ilan (08:04)
Now David, is this workflow with your prompt and all of the settings something that we can export so that people can just download it and load it into ComfyUI if they want to follow along.

David (08:14)
Absolutely. It's just a JSON file.

All right, so now we've got a three minute, 28 second long pop song, and that was completed in 262 seconds. So I think, you know, it's actually kind of close to the amount of time of the actual song. So let's have a listen.

Ilan (08:31)
Ha

David (09:01)
Not bad. By the way, I don't know if you know that the lyrics are about product management and software. I mean, it's yeah. Yeah. Let's let's fast forward to the chorus and see how that goes.

Ilan (09:06)
I did notice that.

David (10:01)
Not bad. Not bad for, you know, three minutes or so of just consumer grade GPU.

Ilan (10:02)
Wow.

That is really incredible. I'm actually floored at how good that was.

David (10:13)
So there you go, there's Dollar Tree, Taylor Swift, ⁓ and hey, maybe we're gonna launch an album.

Ilan (10:17)
You

Yeah, coming soon to Spotify.

David (10:22)
Yeah. Okay, let's do something a little bit different. So this has lyrics. So why don't we make a song that doesn't have lyrics? Maybe just something just like an EDM kind of a song.

All right, so let's be AI native and generate the description for our new song. So I'm gonna go here into Gemini. I'll put it into thinking mode and I'm gonna say, hey, give me the description for a Bollywood EDM cross song that becomes a club anthem and be sure to include the key beats per minute and time signature. I also gave it an example description.

So here we go.

Okay, and we're done. I'm just gonna just go ahead and copy this. Well, according to you, it fired all billions of the nodes there, whatever. Okay. So, all right, let's copy this over

Ilan (10:59)
It did not think very hard there.

David (11:11)
Okay, so we've come back here into CompiUI. I've copied and pasted the description into the prompt. I've removed the lyrics. I've set the duration to 120 seconds just to be quick. And here I've set the beats per minute 128. The key scale is B minor. Time signature is still 4/4. All right, let's do this.

All right, so that took all of 17 seconds. right, 17 seconds to make this possible club anthem. Let's hear how it goes.

Ilan (11:37)
Wow.

David (12:31)
Not bad. Not bad.

Ilan (12:32)
Yeah,

not sure it's gonna be the next song playing at all the clubs in Berlin, but you know, not bad for 17 seconds.

David (12:40)
Pretty good for 17 seconds, better than any human can do.

Ilan (12:42)
you

David (12:43)
Okay. So that was generating music. Let's talk about generating speech. So there's this other model, Qwen3 So it came from, from, Qwen the, Alibaba lab, Qwen3 text-to-speech TTS. And so again, something that is open weights, you can download this and pop it into ComfyUI.

And here it is. So this is the workflow for Qwen-TTS and we will include a link to this in the show notes. And you'll notice that there's three different ⁓ flows that we can work with. Let's just first talk about this one here, a voice cloning ability, because I think this is pretty powerful.

So what you do here is you upload an existing audio clip of somebody talking.

⁓ And what you can do is say you give it a transcript, right? Hey, here's what this person is saying. And then over you tell the model to make this person say something else instead. Right? So here we have this clip of JFK.

Right? So this is something that we have transcribed and put it into this section over here. Everything else here doesn't need to change. So let's go ahead and start generating that. The requirements for this are also not that high. They're not very explicit.

⁓ with ⁓ the precise requirements. I can say that my machine has 16 gigabytes of VRAM and it works just fine. I don't think this model is... ⁓ this model is not very big and so that's one way that you can assess how much ⁓ demand it's going to have on VRAM is just by the raw size of the model.

Ilan (14:25)
And that would be in terms of how many millions or billions of parameters it has.

David (14:28)
Yes, that's one of the properties of it. But if you just look at the raw file size of the model, then you'll figure that out. All right, so here what we've done is we put in that JFK clip and we told them to say this funny little sentence about what it's like to be a product person. Let's hear how that goes.

Ilan (14:34)
Okay.

You

David (14:57)
Pretty good, right? So you got a couple things there. One is it sounds like he's still at the same place, right? So there's the reverb. And additionally, that you notice that it also has his subtle accent.

Ilan (15:03)
Mm-hmm.

It does, yeah. The one thing I'm wondering after hearing that is how possible is it to change emphasis on words or parts of the sentence?

David (15:18)
That's a great question. I've tried to modify exactly how the speaker puts emphasis on certain words and even emotion.

and it doesn't do that very well. It tends to be kind of monotonous and, and it'll just choose where it wants to, to put emphasis. and that is a limitation of this model versus some of the commercial models. So if you work with, let's say, in Gemini, there's AI studio where you can generate voices. You can actually give, commands for emotion, in, in your prompt. So you can say sad or happy or laughing. And the.

Ilan (15:28)
Okay.

David (15:52)
voice out of generates will actually do that. Whereas here, what it'll do is just read that text out, which is awesome. Yeah.

Okay, so that's voice cloning. by the way, you actually don't need to have the transcript here. Having the transcript makes it easier to, or makes it, makes the quality better, right? But what I'm going to do here is actually remove the reference audio text as in the transcript. And I'm going to change this property here, which is x_vector_only. don't worry about what that means. What it means is whether or not you have a transcript there.

So I changed it from false to true. And let's have a listen.

Ilan (16:30)
That's how you can tell this was built by engineers because a product manager would make that switch by itself if there's text in there.

David (16:37)
That's really good point, yes.

All right, so this was generated so fast, by the way, but let's have a listen.

So I don't know if you noticed a little bit of a delta there. I noticed that that picked up a little bit of the reverb, which is nice, but the accent kind of got lost.

Ilan (17:02)
Yeah, I did notice that. that's where the having the text or the transcript of the original audio really helps.

David (17:08)
Yeah, exactly. And again, if you're AI native, just throw that into an LLM and tell it to transcribe it for you. What are you doing? All right. So one other thing to know about this model, and just like with the other voice synthesis models, is that you can generate your own. So over here, there's these other nodes for designing your own voice. So here's one where you can use one of the preset voices here.

Ilan (17:13)
Yeah.

David (17:31)
So you can choose Eric to say this line here. Sounds like he's kind of angry. So let's have a look and see how that goes.

Ilan (17:54)
You

David (17:54)
So I don't know why that voice has a Chinese accent. Maybe it has to do with the training data, but that was amusing to me. So that's using one of the preset voices. I'm not going to go through it, but over here, you can actually design your own voice, right? So you can actually say, hey, look, here's what this person, here's what the voice is like. And you can also choose the language that they're speaking. So, you know.

Ilan (18:08)
Okay.

David (18:15)
Lots of options here for other languages to generate.

Ilan (18:18)
I'm not gonna lie, David, after seeing all of this, I am even more worried about deepfakes than I ever have been.

David (18:27)
It's going to be a brand new world out there.

Ilan (18:30)
Well David, that was super cool. The fact that you were able to generate those clips, both music and audio, so quickly, all on your local machine is really...

incredible where these models have gotten to and what the capabilities have gotten to, to where you don't need these cloud subscriptions for multiple services anymore. You can just do this on your machine with your regular hardware. So thank you so much for showing us that.

David (18:58)
Yeah, it was fun walking you through that and I hope our audience got some good learnings out of that.

Ilan (19:04)
Absolutely, I'm sure they did well with that. Thank you so much for listening if you like this episode Please subscribe give us a like and share it with somebody who you think might enjoy it. Hopefully it helps them out, too

David (19:17)
And if you like

Ilan's shirt, it's available at our store.

Ilan (19:20)
That's right David, at DevilWearsProduct.shop you can get all kinds of product management and developer themed gear. So check it out.

David (19:29)
All right, awesome. So that wraps it for this week. We'll see you at the next one.

Ilan (19:33)
See you then.

© 2025 Prompt and Circumstance