lip sync options?

Message

zen · #1 Post by **zen** » Sat Dec 03, 2005 8:44 pm

I am playing with lip syncing faces to .wav files. It works well enough using the anim.SMAnimation function and individually constructed faces, one for each phoneme. However, it would be much less work if the live composite function supported an animation built on the anim.SMAnimation function. Perhaps it does, but so far I can't figure out the syntax. If it does, an example of the syntax would be most helpful. Below is a possible example of a blinking mouth on a face.

image grin livecomposite = LiveComposite((200, 300),
(0, 0), anim.Blink(Image("mouth.png"),
(0, 50), "face.png"))

How would this blinking animation be changed to use the animation created below instead?

image girl mouth = anim.SMAnimation("a",
anim.State("a", "u.png"),
anim.Edge("a", .5, "a" ),
anim.State("b", "wq.png"),
anim.Edge("a", .5, "b"),
anim.Edge("b", .5, "a"),
)

If this approach is not possible, can you recommend another approach? I don't "need" this yet in a game, I'm just exploring the engine. I could see however that if this worked, the SMAnimation function could someday be modified to read in or otherwise leverage .txt or .jls files that provide the info for frame/edge, delay, and phoneme.

Once again, thanks for the fun.

#2 Post by **PyTom** » Sat Dec 03, 2005 9:50 pm

Code: Select all

image grin livecomposite = LiveComposite((200, 300),
                                             (0, 0),  anim.SMAnimation("a",
        anim.State("a", "u.png"),
        anim.Edge("a", .5, "a" ),  
        anim.State("b", "wq.png"),
        anim.Edge("a", .5, "b"),
        anim.Edge("b", .5, "a"),
        ),   

                                             (0, 50), "face.png"))

I'm too lazy to go in and reindent that, but it should work.

If you can think of a better way of specifying the images, then I
can do that as well... but I don't know what the way would be.

I think it might also make sense to include some way of stopping the animation when the voice stops. This way, it's not necessary to manually change the image.

It would probably be best to encapsulate this in a function, rather than requiring one to declare a new image for many lines of dialogue. This could take a lip-sync file as one of the inputs... can you describe the file format?

Please, work with me before doing things the hard way... There is probably an easy way of doing it.

zen · #3 Post by **zen** » Sun Dec 04, 2005 12:29 am

That worked perfectly. In case a future reader snips that code, just need to remove the final paren. to remove a syntax error.

In order to lipsync properly, we need to create a graphic image of each of a few phonemes, and one for closed. Then we need to know when to change the image as a wav file plays. There are several good software packages that make this fairly easy. Magpie is the best known, and a very similar open source "clone" is Jlipsync . Both allow you to input a wav file, and find the start and end of each phoneme. They both allow you to copy to the clipboard a file that shows these breaks and the delays between. This is exactly what is needed to build the animation as described above.

Magpie, which is shareware, outputs the spoken word "Hasta La Vista" (as in Arnold saying "hasta la vista, baby" in this format:

1 00:00:00.00
2 00:00:00.01
3 00:00:00.02 Closed
4 00:00:00.03 A
5 00:00:00.04 A
6 00:00:00.05 A
7 00:00:00.06 A
8 00:00:00.07 S
9 00:00:00.08 S
10 00:00:00.09 S
11 00:00:00.10 S
12 00:00:00.11 T
13 00:00:00.12 T
14 00:00:00.13 A
15 00:00:00.14 A
16 00:00:00.15 L
17 00:00:00.16 L
18 00:00:00.17 L
19 00:00:00.18 A
20 00:00:00.19 A
21 00:00:00.20 V
22 00:00:00.21 V
23 00:00:00.22 E
24 00:00:00.23 E
25 00:00:00.24 E
26 00:00:00.25 S
27 00:00:00.26 S
28 00:00:00.27 S
29 00:00:00.28 S
30 00:00:00.29 T
31 00:00:01.00 T
32 00:00:01.01 T
33 00:00:01.02 A
34 00:00:01.03 A
35 00:00:01.04 A
36 00:00:01.05 Closed

The first column is the frame number, second is the "time", and the third is the phoneme label. This example was code using 30 frames/sec., thus a little math is needed to calculate how much time is used between any given number of frames. The format does not tell what the frames per second is, but you can figure that out easily enough by parsing the 2nd column to see how many frames are used to reach 01.00 then subtract 1.

Jlipsync uses a very similar file layout:

Frame Timecode Key Mouth name Comments
1 00:00:00.01
2 00:00:00.02
3 00:00:00.03 Closed
4 00:00:00.04 A
5 00:00:00.05
6 00:00:00.06
7 00:00:00.07
8 00:00:00.08
9 00:00:00.09 S
10 00:00:00.10
11 00:00:00.11
12 00:00:00.12
13 00:00:00.13 T
14 00:00:00.14
15 00:00:00.15 A
16 00:00:00.16
17 00:00:00.17
18 00:00:00.18 L
19 00:00:00.19
20 00:00:00.20 A
21 00:00:00.21
22 00:00:00.22
23 00:00:00.23 V
24 00:00:00.24
25 00:00:00.25 E
26 00:00:00.26
27 00:00:00.27 S
28 00:00:00.28
29 00:00:00.29
30 00:00:00.30
31 00:00:01.31
32 00:00:01.32 T
33 00:00:01.33
34 00:00:01.34
35 00:00:01.35 A
36 00:00:01.36
37 00:00:01.37
38 00:00:01.38 Closed

I would recommend if you do decide to add lipsync funtionality into ren'py someday you support Jlipsync format as it is open source, and runs on every platform (it's a java app).

Basicly, there are a minimum of 9 phonemes and a rest or close needed to do a reasonable lipsync. They are:

1. A, I
2. O
3. E (as in sweet)
4. U
5. C, K, G, J, R, S, TH, Y, Z
6. D, L, N, T
7. W, Q
8. M, B, P
9. F, V
10. Rest state or closed

For more acurate lipsyncing you just break out a few like TH and show that tongue!

This is probably enough info to describe the process. I imagine a function would need to know what the existing animation function does, but it would get its edge info (image, which would be the phoneme image, delay, which is calculated based on number of frames between phonemes, adjusting for framerate, and x,y position. This graphic could be reused for every character, but it would be nice to pass the character name into the function, and prefix it to the phoneme image name. So the "U" sound would genericly be u.png (or .gif, but .png is cleaner), and perhaps a character named Arnold would look for arnold_u.png, otherwise fail to u.png

Ok, this message is long enough. Thanks for taking the time to read it. I think this function might be an interesting one, but unless there is demand for it, don't knock yourself out.

#4 Post by **PyTom** » Sun Dec 04, 2005 12:25 pm

Two things.

First of all, I would have no problem adding in an anim.LipSync object, which is a widget that would lipsync along with a voice. This would be used in conjunction with, say, a lipvoice function, which would be responsible for playing sound and setting the lipsync file that the lips would, well, sync to.

So it would look something like this:

Code: Select all

init: 
    image shana = LiveComposite((300, 480), 
        (0, 0), "shana_base.png"
        (125, 100), anim.LipSync('shana', mouths=dict(
            a = "shana_mouth_a.png",
            b = "shana_mouth_b.png",
            ...
            closed = "shana_mouth_closed.png")))

label start:
    
     scene ...
     show shana
     $ lipvoice('shana_1.ogg', 'shana', 'shana_1.jlis', fps=30)
     s "Flame haze... Mistess... I'm the one with the sword!"

The idea is that one shows an anim.LipSync named shana, and then uses lipvoice to both play a voice file, and at that same time to tell that lip sync which file to use.

In order to code this up, I will need some voice files, the corresponding mouth files, and some lipsync files. If I have all of that, then I'd be willing to code it up. I'd also ask that you have a mostly done game (say, with a complete script), which is usually my criteria for adding new features to Ren'Py.

Now, that being said, I'm sort of questioning if it's really necessary to have that level of lip synchronization. In the games I've played that have lip sync to voice, there seem to be as little as three or so lip positions per character, and it's not clear to me that they are actually synchronized to the voice, as compared to simply being randomly animated when the voice is playing. (Perhaps they are and I just didn't notice it.)

I think my point is that you should think fairly hard about if you really want to go through all this trouble. Actual american-animation-style synchronization seems like a lot of work for something that probably few people will notice... perhaps your efforts could be spent better elsewhere.

As a game-maker, it's your call, and I'd have no problem implementing the supporting code. But at the same time, I'd eventually want to play your game.

mikey · #5 Post by **mikey** » Sun Dec 04, 2005 6:36 pm

Hmmm, you can't deny it's a great idea, really. But at the same time it has that gimmick-potential, meaning if the game doesn't have the power to carry this, it will easily feel like golden handlebars on a dirtbike. (excuse the metaphore, I was actually looking for "killing the fly with a cannonball" one).

So I'm really rooting for you to make this happen in a good way, but if it gets too complicated, I think a simple solution would be to make the character randomly open and close their lips for the duration of text/speech.

It's kind of a cheap effect, but it's one that is very noticeble while not being overly complicated (or so I think).

But hey, I am known for being stubborn in many areas of my game design, so if you think differently, just go ahead and do it, I'm with that as well ^_^

Megaman Z · #6 Post by **Megaman Z** » Tue Dec 06, 2005 10:15 pm

mikey wrote:...but if it gets too complicated, I think a simple solution would be to make the character randomly open and close their lips for the duration of text/speech.

ugh... that brings back some weird and pitiful dubbing memories

*FLASHBACK TO SECOND-TO-LAST FMV OF PS2 GAME LIFELINE*

yeah... worst dubbing ever. I could've easily found better words to fit the mouthflapping. since when does it take FOUR mouth-openings to say "to dust?"? (I know, probably grammatically incorrect, but you get the point, right?)

monele · #7 Post by **monele** » Wed Dec 07, 2005 4:44 am

Just another way to do it that I witnessed in Tokimeki Memorial 1 & 2 : have a number of frames from closed mouth to fully opened mouth and set a min/max volume fitting those two extreme states. Then, as the voice plays, have the frames be displayed according to the volume of the sound. It's amazingly simple, yet the results are really good as long as you don't pay too much attention to how the mouth should look because of sounds (e, a, u, ...) and if you have *expressive* voice actors. Doesn't mean they have to yell but a very monotonous speech would be rather bad.

Lemma Soft Forums

lip sync options?

lip sync options?

Re: lip sync options?

Who is online