SDF.ORG @SDF

**Yukari Hafner** @shinmera@tymoon.eu · Jan 3

I'm off to bed now, but in case anyone has thoughts about this I'd be all ears:

Any ideas on how to do very simple human voice recognition? I just want to detect whether an audio stream is likely to be a voice or not, to improve the accuracy over a simple volume based approach that most chat things use.

The best I've come up with is checking the largest frequency bin and whether it lies in a normal vocal range (100-8k Hz), but that seems like it'd also have lots of false positives.

screwlisp @screwtape@mastodon.sdf.org

@shinmera matched filter approach based on fragments of the person you expect to hear talking talking?

Jan 03, 2025, 23:30 · · 0 · 0

**screwlisp** @screwtape · Jan 3

Jan 3

screwlisp @screwtape

@shinmera https://en.wikipedia.org/wiki/Matched_filter

Matched filter - Wikipediaen.wikipedia.org

**Yukari Hafner** @shinmera@tymoon.eu · Jan 3

Jan 3

Yukari Hafner @shinmera@tymoon.eu

@screwtape Hmm, yeah, I thought about similar stuff, but I really don't want to train on a specific voice or anything. I guess convolving with an inverse average voice frequency response and then checking deviation could work?

**screwlisp** @screwtape · Jan 3

Jan 3

screwlisp @screwtape

@shinmera I guess you could base it on a range of people instead of one person, and it would work better on average and worse in any particular case. This is a normal receiver operating characteristic scenario isn't it? There will be a lot of implementations of this sitting around I think. (produces none).

**Yukari Hafner** @shinmera@tymoon.eu · Jan 3

Jan 3

Yukari Hafner @shinmera@tymoon.eu

@screwtape I have no idea, and in general signal processing theory stuff is extremely incompatible with my brain, so....

Anyway, I just want to do something a bit smarter than the usual voice chat thing of thresholding by volume, with the hopes I'll prevent it from being triggered by random noises.

Since I want to use it to drive the avatar's mouth open/close, having minor noise be treated as signal is far worse than in a voice chat app situation.

**screwlisp** @screwtape · Jan 4

Jan 4

screwlisp @screwtape

@shinmera oh, I'm aware people have done what you're saying in particular, but I've never watched such a thing. Basically you do a discrete convolution/correlation of two arrays, the test sample, and the kernel. We expect that bins in the result exceeding some sensitivity number you choose by trial and error are detections of the kernel in the test sample. You judge your quality by the True Positive Fraction and False Positive Fraction for your chosen sensitivity.

**screwlisp** @screwtape · Jan 4

Jan 4

screwlisp @screwtape

@shinmera I'll do a demo of a matched filter for the show on wednesday, since I was building fourier transform pipeline demos right now anyway.

**screwlisp** @screwtape · Jan 4

Jan 4

screwlisp @screwtape

@shinmera I would run a bank of filters and max the results. That's my feel for "any of multiple things are happening".

Drag & drop to upload

Administered by:

Server stats:

Trending now

Administered by:

Server stats:

Back

Trending now