Hey all, brand new to this community, excited to be here!

I’ve stumbled my way through SD and I currently also have text-generation-webui up and running, and now SillyTavern. Having lots of fun with all of this stuff, learning how it works together, and how it all works!

I’ve made a few models elsewhere, but TTS models for some reason I’m having issues wrapping my head around. I have a voice I want to make a model for, and I have some videos currently. I’m very familiar with editing audio and video, but stripping out their voice second by second sounds exhausting tbh.

I was wondering if anyone had any good guides on their process of making a TTS model? Are there steps that can be automated while still producing decent results? How much time do I need of a person speaking? Should I run any specific tools to clean up audio? I’m completely new so any and all advice would be great.

I want to run it locally and “plug it in” to my cluster already, so also I’ll need the model to work with a tool that will work with the above programs (and I’ll take advice there too if you have it!)

Thanks!