F5TTS, LivePortrait, X-Pose AI blitz

Got encouraged enough from speech demos to try bringing up some more models.

Notes for installing f5-tts:

https://github.com/SWivid/F5-TTS

The python environment was created based on the README:

python3 -m venv f5tts
source f5tts/bin/activate

pip install torch==2.4.0+cu124 torchaudio==2.4.0+cu124 --extra-index-url https://download.pytorch.org/whl/cu124

pip install f5-tts

That took 7GB.

Helas, it requires CUDA 12.4. CUDA 12.4 now requires apt to install itself & that has always been broken on the lion kingdom's 7 year old ubunt. It has no hope of running on the old C library + kernal anyway. The lion kingdom needs a way to dual boot without disrupting the day job. CUDA won't drop the GTX 1050 until version 14.

The laptop's Ubunt 18 is no longer supported & its apt installation has slowly broken over time, but apt worked just enough to install the C library dependencies. It seems if it doesn't have a matching CUDA version, it silently reverts to software & hangs.

The trouble spot is the CUDA upgrade for every new AI tool, manely installing the matching pair of CUDA toolkit & drivers. For CUDA 12.4, it seems to require using apt to install cuda-drivers-550, which is broken.

cuda-repo-ubuntu2004-12-4-local_12.4.0-550.54.14-1_amd64.deb

NVIDIA-Linux-x86_64-550.144.03.run

In this case, there was a downloadable driver with a version that matched a number in the CUDA filename. That got the test script to detect it.

source f5tts/bin/activate

python3

import torch
print(f"GPU device: {torch.cuda.get_device_name(0)}")

When f5-tts_infer-cli is running, nvidia-smi shows it using 1.5G of VRAM. The 1 demo takes 71s to render on the GTX 970M. It's a hair faster than the GTX 1050, based on the internet.

f5-tts_infer-cli --model F5TTS_v1_Base \
--ref_audio "/root/f5tts/relaxing1.wav" \
--ref_text "Once upon a time, in the middle of winter, when the flakes of snow were falling like feathers from the sky, a queen sat at a window sewing." \
--gen_text "Now for another forgotten piece of gen X sauce"

42 seconds at 120W later, the resemblance to the reference voice is lightyears ahead of festival but the vocal tones & phrasing are just as bad as festival. They continue to have the odd inflections of 30 years ago. There's just not as much value in replicating a reference voice as there is in making intelligible inflections & phrasing.

On a lion budget, it's definitely not going to be generating any realtime aural cues like festival did & it's not going to make any audio books.

All the audio is 24khz 16 bit 1 channel. It might show promise in generating unintelligible ASMR, but yet another AI model bringup has gone down feeling like a waste of time.

There's a way to script the command with .toml files.

f5-tts_infer-cli -c asmr.toml

Then you can make the script:

model = "F5TTS_v1_Base"
ref_audio = "relaxing1.wav"
ref_text = "Once upon a time, in the middle of winter, when the flakes of snow were falling like feathers from the sky, a queen sat at a window sewing."
gen_file = "the_text.txt"
remove_silence = false
output_dir = "output"
output_file = "output.wav"

A paragraph takes 5 minutes. It crashes if output_dir is blank. No output is allowed in the cwd. The laptop now dies if it's unplugged. Its battery might have perished. It needs multiple .toml files to hedge against crashes.

Results were good enough to try to bring up 1 of the lip sync models.

------------------------------------------------------------------------------------------------

The decision was made to try to get talking animals working with F5TTS.

The journey begins by downloading liveportrait.

git clone https://github.com/KwaiVGI/LivePortrait

Then creating another python environment

python3 -m venv LivePortrait

source LivePortrait/bin/activate

Then install the matching pytorch for the CUDA version used in F5TTS.

pip install torch==2.4.0+cu124 torchaudio==2.4.0+cu124 --extra-index-url https://download.pytorch.org/whl/cu124

They said animal poses require X-pose.

git clone https://github.com/IDEA-Research/X-Pose.git

python3 -m venv X-Pose

source X-Pose/bin/activate

Helas, X-pose & Liveportrait require pytorch 1.12 while F5TTS requires pytorch 2.4.0+cu124. X-pose uninstalls torch 2.4 & then the models/UniPose/ops step fails on an incompatible version of CUDA. X-pose & liveportrait have other incompatible dependencies with f5tts so they can't use the same venv. The mane problem is pip automatically uninstalls any conflicts so you're installing each line of requirements.txt, then watching for it to start reinstalling cuda & interrupting it.

You have to run the torch==2.4.0+cu124 command before the requirements, then

pip install torchvision==0.19

The only way to find the matching version was trial & error.

It didn't accept timm with cuda 124.

pip install --no-deps timm

It didn't accept numpy==1.21.5 at all.

transformers==4.22.0 didn't compile but pip install transformers did.

The python -m pip install --upgrade setuptools line failed on ImpImporter. This required:

python -m ensurepip --upgrade
python -m pip install --upgrade setuptools

Helas, X-Pose ran out of memory. It seems to run check_gradient_numerical with increasing values for channels until it fails. It got to 71.

X-pose required gcc-9 which required upgrading to Ubunt 24.

Then it was back to installing the LivePortrait dependencies not in X-Pose, 1 line at a time, aborting any uninstalls. It also needed pip install onnxruntime.

ffmpeg, ffprobe came from a cinelerra tree.

As far as lip syncing a synthetic voice, the easiest solution seems to be manually recording yourself & using that as a driver for LivePortrait. Lions make a 512x512 driver video.

python inference_animals.py -s lion3.jpeg -d lion_driver.mp4 --driving_multiplier 1.75

Getting good results from inference_animals.py & inference.py requires getting the right source image & driver video. The 2 scripts have the same limitations. The source resolution is the output resolution but the source aspect ratio needs to be 1:1. An aspect ratio of 4:3 made squeezed output.

It always chopped off a lion's right ear, regardless of flipping the source image or driver video.

The --no_flag_stitching option fixed the ear problem but cropped it to the face, made it shaky & blacked out the top. Stabilizing the driver video made no difference. That option might be for manually compositing the face back in the original image.

--no_flag_do_crop --no_flag_stitching preserves the entire source image & shrinks the output to 512x512. This yields an image more suitable for manually compositing in the original image. All python inference_animals.py does by default is composite the cropped face back in the original image using a fixed garbage matte. It doesn't do any optical flow or warping. Sometimes, it crops out important parts of the face, like an ear or a chin.

--driving_multiplier 1.75 increases the shaking. You're better off trying to exaggerate the facial expressions manually. All the options yield shaky output. The shaking is hidden by the default cropping, in addition to cropping the ear & the chin. A locked down camera reduces the shaking. In a pinch, there's always applying stabilization to the 512x512. We lived 30 years with 480x480.

All those days of effort finally yielded the 1st synthetic lion. --no_flag_do_crop --no_flag_stitching with a locked down camera are definitely yielding the best results. Reading an audio book would be quite difficult. It desperately needs animation keyframes from f5tts but everyone is recording themselves. It's a cut above manually keyframing, motion capture, & 3D modeling but still an investment & not as realistic.

The animal model only takes still photos while the human model takes video. The animal model doesn't work with bald eagles. It needs similar proportions to a human.

-------------------------------------------------------------------------------------------------------------

During this process, it became clear the lion kingdom needs 1 obsolete machine to be always available for daily use, another updated machine just for non essential modern programs & a full KVM to avoid dual booting. The laptop has been for the non essential programs.

Younger lion did invest a lot of time upgrading kernals, C libraries & compilers from source code on his Cyrix 6x6. Somehow, it's much more daunting 30 years later. The dependency chains have gotten a lot more complex. The payoff is a lot less.

apt dist-upgrade was a fail

Ubunt 24 still has an option to leave the root partition alone & install over it. This too failed while creating symbolic links on top of existing files.

The decision was made to back up the hard drive & reformat it for ubunt 24. Managed to get gigabit over the network by changing ports, this time. It doesn't normally go above 100baseT regardless of cable. ethtool shows what speed it's currently using.

The matching driver 550.54.14 for CUDA 12.4 wouldn't compile on ubunt 24. The nvidia-driver-550 package similarly failed. Driver 470.57.02 compiled & seemed to be compatible with CUDA 12.4.

It seems when apt fails to install some packages, it gets into a permanently broken state, like git failing a rebase. You have to run apt purge to try to recover it.

Countreps broke & would have to be recompiled along with all the dependencies, over 2 days. It was heavily clobbered from a port to the failed jetson nano. Younger lion just merged the source code with the jetson but never recompiled it on the laptop.

*** stack smashing detected ***: terminated

is the new error when functions don't return a value.

It's become clear developing AI models is more math than software development. That's why it's become standardized on python. The next era might revolve purely around the math that creates the models which create the software.

Search This Blog

Lion ponderings

F5TTS, LivePortrait, X-Pose AI blitz

Comments

Post a Comment

Popular posts from this blog

snow white