Quantcast
Channel: Speech Recognition With Vosk
Viewing all 51 articles
Browse latest View live

Optimization in SphinxTrain

$
0
0
I spend quite significant amount of time training various models. It feels like alchemy, you add this and tune there and you get nice results. And while training you can read twitter ;) I'm also 10 years in a group which is creating optimizing compilers so in theory I should know a lot about them. I rarely apply it in practice though. But being bored with several weeks training you can apply some knowledge here.



So the algorithm:

1) Train a model for a month and become bored
2) Get an idea that SphinxTrain is compiled without optimization
3) Go to SphinxTrain/config and change compilation option from -O2 to -O3
4) Measure run time of a simple bw run with time command
5) See that time doesn't really change
6) Add -pg option to CFLAGS and LDFLAGS to collect profile
7) See most of the time we are running log_diag_eval function which is a simple weighted dot product computation
8) See the assembler code of the log_diag_eval

0x42c3b0 log_diag_eval: unpcklps %xmm0,%xmm0
0x42c3b3 log_diag_eval+3: test %ecx,%ecx
0x42c3b5 log_diag_eval+5: cvtps2pd %xmm0,%xmm0
0x42c3b8 log_diag_eval+8: je 0x42c3fd log_diag_eval+77
0x42c3ba log_diag_eval+10: sub $0x1,%ecx
0x42c3bd log_diag_eval+13: xor %eax,%eax
0x42c3bf log_diag_eval+15: lea 0x4(,%rcx,4),%rcx
0x42c3c7 log_diag_eval+23: nopw 0x0(%rax,%rax,1)
0x42c3d0 log_diag_eval+32: movss (%rdi,%rax,1),%xmm1
0x42c3d5 log_diag_eval+37: subss (%rsi,%rax,1),%xmm1
0x42c3da log_diag_eval+42: unpcklps %xmm1,%xmm1
0x42c3dd log_diag_eval+45: cvtps2pd %xmm1,%xmm2
0x42c3e0 log_diag_eval+48: movss (%rdx,%rax,1),%xmm1
0x42c3e5 log_diag_eval+53: add $0x4,%rax
0x42c3e9 log_diag_eval+57: cmp %rcx,%rax
0x42c3ec log_diag_eval+60: cvtps2pd %xmm1,%xmm1
0x42c3ef log_diag_eval+63: mulsd %xmm2,%xmm1
0x42c3f3 log_diag_eval+67: mulsd %xmm2,%xmm1
0x42c3f7 log_diag_eval+71: subsd %xmm1,%xmm0
0x42c3fb log_diag_eval+75: jne 0x42c3d0 log_diag_eval+32
0x42c3fd log_diag_eval+77: repz retq

9) Understand that it's not really as good here as it can be

10) Run

gcc -DPACKAGE_NAME=\"SphinxTrain\" -DPACKAGE_TARNAME=\"sphinxtrain\" \
-DPACKAGE_VERSION=\"1.0.99\" -DPACKAGE_STRING=\"SphinxTrain\ 1.0.99\" \
-DPACKAGE_BUGREPORT=\"\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 \
-DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 \
-DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_LIBM=1 \
-I/home/nshmyrev/SphinxTrain/../sphinxbase/include \
-I/home/nshmyrev/SphinxTrain/../sphinxbase/include -I../../../include -O3 \
-g -Wall -fPIC -DPIC -c gauden.c -o obj.x86_64-unknown-linux-gnu/gauden.o \
-ftree-vectorizer-verbose=2

to see that log_diag_eval loop isn't vectorized

11) Add -ffast-math and see it doesn't help

12) Rewrite function from

float64
log_diag_eval(vector_t obs,
float32 norm,
vector_t mean,
vector_t var_fact,
uint32 veclen)
{
float64 d, diff;
uint32 l;

d = norm; /* log (1 / 2 pi |sigma^2|) */

for (l = 0; l < veclen; l++) {
diff = obs[l] - mean[l];
d -= var_fact[l] * diff * diff; /* compute -1 / (2 sigma ^2) * (x - m) ^ 2 terms */
}

return d;
}

to

log_diag_eval(vector_t obs,
float32 norm,
vector_t mean,
vector_t var_fact,
uint32 veclen)
{
float64 d, diff;
uint32 l;

d = 0.0;

for (l = 0; l < veclen; l++) {
diff = obs[l] - mean[l];
d += var_fact[l] * diff * diff; /* compute -1 / (2 sigma ^2) * (x - m) ^ 2 terms */
}

return norm - d; /* log (1 / 2 pi |sigma^2|) */
}

to turn substraction which hurts to accumulation.

13) See that loop is now vectorized. Enjoy the speed!!!

The key thing to understand here is that programming is rather flexible and compilers are rather dumb. But you have to cooperate. So you need to use very simple constructs to let compiler do his work. Moreover, this idea of using simple constructs in the code has other benefits since it helps to keep code style clean and enables automated static analysis with tools like splint.

Maybe same applies to speech recognition. We need to help computers in their efforts to understand us. Speak slowly and articulate clearly and both we and computers will enjoy the result

If you are interested about loop vectorization in GCC, see here http://gcc.gnu.org/projects/tree-ssa/vectorization.html

Some more optimization

$
0
0
In addition two the previous post, two more tricks for log_diag_eval.

Floats instead of double

If accumulator is float, SSE could be used more effectively

Hardcode vector length

The most common optimizaition is loop unrolling. It helps to optimize memory access as well as eliminates jump commands. But the issue here is that number of iterations in log_diag_eval can be different on various stages. GCC has interesting profile-based optimizaition for this case, see -fprofile-generate option. It runs a program and then can derive few specific optimizations form the runtime. Good point is that we actually can be almost sure in usage patters of the our target loop, so we can optimize without profiling. So, turn


for (i=0;i<veclen;i++) {
   do work
}


to

if (veclen == 40) { // Common used value, 40 floats in each frame
    for (i=0;i<40;i++) {
        do work // This will be unrolled
   } else {
    for (i=0;i<veclen;i++)
        do work
    }
}



GCC does same trick with profiler, but since our feature frame size is fixed, we can hardcode. As a result GCC will unroll first loop and it will be fast as a wind

Looking on the waves

$
0
0
Here is the question - a perfectly looking sound file which is transcribed with 10% accuracy. Sounds crazy, isn't it? Click on it to enlarge. No noise, no accent.



Because of that I'm looking on state-of-art in channel normalization, especially for non-linear channel distortions. No good solution yet, I've only found the description of the problem in very old paper



There is CDCN normalization, few CMN improvements, RASTA and even recently invented HN normalization. CDCN is suprisingly available in Sphinxtrain but nobody uses it. Well it gives no improvement but it's an interesting approach worth to document one day. The idea to collect statistics from the speech to apply it later sounds nice.

There are model-level approaches, various feature transforms, adaptations. They do not really look that attractive. Most papers now deal with channel compensation for speaker recognition, not speech recognition. I must admit the topic is too large to overview it in few weeks.

Luckily, I can also spend time looking on the waves like the one on the right. Somewhat more pleasant I would say.

Openfst troubleshooting

$
0
0
A bit of openfst troubleshooting when you try to build WFST with Juicer. Say you are running


fstcompose ${OUTLEXBFSM} ${OUTGRAMBFSM} | \
fstepsnormalize | \
fstdeterminize | \
fstencode --encode_labels - $CODEX | \
fstminimize - | \
fstencode --decode - $CODEX | \
fstpush --push_weights | \
fstarcsort

and get this


FATAL: StringWeight::Plus: unequal arguments (non-functional FST?)


Huh? Which arguments are not equal? What caused this? How to fix this? Definitely it should be more self-explaining. That's basically quite a common issue. You get just a short message that nobody including the author could understand. Go find out how to fix it.





In this particular case you go to the openfst sources and change the following line:


  if (w1 != w2)
    LOG(FATAL) << "StringWeight::Plus: unequal arguments "
               << "(non-functional FST?) "<< w1 << ""<< w2;

Wait another half an hour for it to compile (who decided to make it with pure templates!). See that it outputs arguments now at least. You run again and get

FATAL: StringWeight::Plus: unequal arguments (non-functional FST?) 833_9 832_9

Heh, also not very descriptive but at least some hint. Looking on the output states 833 and 832 you see
that they have identical pronunciation. That's it. Your dictionary shouldn't have identical pronunciation. Moreover, it shouldn't have identically pronounced trigrams. Things pronounced like "a b cd" vs "ab c d" make wfst non-deterministic. Why didn't it warn about the issue when it converted the dictionary? Who knows. Anyway, now you can read about lexgen and find the option to fight with identical pronunciation:

  -outputAuxPhones           -> indicates that auxiliary phones should be added to pronunciationsin the lexicon in order to disambiguate distinct words withidentical pronunciations 

This option should make things better.

I must admit CMUSphinx is also full of this. Bad error messages which doesn't describe the problem nor hint the solution. Compare too the output of recent maven

[ERROR] No goals have been specified for this build. You must specify a valid lifecycle phase or a goal in the format : or :[:]:. Available lifecycle phases are: validate, initialize, generate-sources, process-sources, generate-resources, process-resources, compile, process-classes, generate-test-sources, process-test-sources, generate-test-resources, process-test-resources, test-compile, process-test-classes, test, prepare-package, package, pre-integration-test, integration-test, post-integration-test, verify, install, deploy, pre-clean, clean, post-clean, pre-site, site, post-site, site-deploy. -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/NoGoalSpecifiedException


Maybe it's too verbose but I think it's the right way to do. So if you see something that is not clear in CMUSphinx, please report about it. We'll happily fix it.

Coming up next - what to do when openfst hangs or takes all your memory.

Word position context dependency of Sphinxtrain and WFST

$
0
0
Interesting thing about Sphinxtrain models is that it uses word position as a context when looking for a senone for a particular word sequence. That means that in theory a senone for the start word phones could be different from senones for the middle-word phones and senones for the end-word phones. It's actually sometimes the case:

ZH  UW  ER b    n/a   48   4141   4143   4146 N
ZH UW ER e n/a 48 4141 4143 4146 N
ZH UW ER i n/a 48 4141 4143 4146 N

but

AA  AE   F b    n/a    9    156    184    221 N
AA AE F s n/a 9 149 184 221 N

Here in the WSJ model definition from sphinx4 a symbol in a fourth column means "beginning", "end", "internal" or "single" and the other characters are transition matrix ids and senone ids.

However, if you want to build WFST cascade from the model, it's kind of an issue how to embed the word position into context-dependent part of the cascade. My solution was to ignore position. You can ignore position in already prebuilt model since differences caused by word position are small, but to do it consistently it's better to retrain word-position-independent model.

Since of today you can do this easily, mk_mdef_gen tool supports -ignorewpos option which you can set in scripts. Basically everything is counted as an internal triphone. My tests show that this model is not worse than the original one. At least for conversational speech. Enjoy.

P.S. Want to learn more about WFST - read Paul Dixon's blog http://edobashira.com and Josef Novak's blog http://probablekettle.wordpress.com

Fillers in WFST

$
0
0
Another practical question is - how do you integrate fillers? There is silence class introduced in

A GENERALIZED CONSTRUCTION OF INTEGRATED SPEECH RECOGNITION TRANSDUCERS by Cyril Allauzen, Mehryar Mohri, Michael Riley and Brian Roark

and implemented in transducersaurus.

But you know each practical model has more than just a silence. Fillers like noise, silence, breath, laugh they all go to specific senones in the model. I usually try to minimize them during the training for example joining all them ums, hmms, and mhms into a single phone but I still think they are needed. How to integrate them when you build WFST recognizer?

So I tried few approaches. For example instead of adding just a <sil> class in T transducer I tried to create many branches for each filler. As a result final cascade expands to a huge moster. Like if cascade was 50mb after combination with 1 silence class it is 100mb but after 3-4 classes it's 300mb. Not a nice thing to do.

So I ended in dynamic expansion of silence transitions like this:

if edge is silence:
for filler in fillers:
from node.add_edge(filler)

This seems to work well.

CMUSphinx accepted at Google Summer Of Code 2011

$
0
0
So we are in. Great to know that.

For more information see

http://cmusphinx.sourceforge.net/2011/03/cmusphinx-at-gsoc-2011/

I think it's a big responsibility and a big opportunity as well. Of course we don't consider this as a way to improve CMUSphinx itself or as something that will allow us to get features coded for free. Instead, we are looking for new people to join CMUSphinx, becoming the part of it. Maybe it's a great opportunity for Nexiwave as well.

For now the task is to prepare the list of ideas for the projects. I know they need to be drafted carefully. If you want to help, please jump in. I definitely need some help.

Voicemail transcription with Pocketsphinx and Asterisk (Part 2)

$
0
0
This is a second part which describes voicemail transcription for Asterisk administrators. See previous part which describes how to setup Pocketsphinx here

So you have configured the recognizer to transcribe voicemails and now look on the improved recognizer accuracy. Honestly I can tell you that you will not get perfect transcription results for free unless you will send voicemails to some human-assisted transcription company. You will not get them from the Google either. Though there are several commercial services to try like Yap or Phonetag which specialize on voicemails specifically. Our proprietary Nexiwave technology for example uses way more advanced algorithms and way bigger speech databases than distributed with Pocketsphinx. And it's a really visible difference.

However even the result you can get with Pocketsphinx can be very usable or you. I estimate you can easily get 80-90% accuracy with little effort considering the language of your voicemails is simple.


Now, the core components of the recognizer are:

  • Language model which controls sequence of words
  • Acoustic model which describe how each phone sounds
  • Phonetic dictionary which maps words to phonetic representation
To get better accuracy you need to improve those three. By default the following models are used
  • Dictionary - pocketsphinx/model/lm/en_US/cmu07a.dic
  • Language model - pocketsphinx/model/lm/en_US/hub4.5000.DMP
  • Acoustic model - pocketsphinx/model/hmm/en_US/hub4wsj_sc_8k
So let's try to improve them step by step by the order of importance

Language model
The core reason voicemail transcription is bad is that language model is built for completely different domain. HUB4 is DARPA task to transcribe broadcast news so you see it's very different from the voicemail language. It's perfect to recognize voicemail about NATO or democracy but not about your wife's problems. We need to change the language model.

1) Transcribe some amount of your existing voicemails. A 100 will be already good. Put the transcription in a text file line by line:

hello jim it's steve let's meet at five p m
hello jim buy some milk
jim it's bob i should catch you tomorrow
you fired jim
....

Important here that text is all in lower case one sentence per line and it doesn't have any punctuation.

Then, you can find some domain specific texts in your computer. For example, if you are working as system administrator in chemical company some chemical texts will help to improve the quality of the language model. Take few books and convert them to the same simple text form: split out punctuation, formatting and add them to the text of the transcribed voicemails. Consider your email archives, they can be also good.

3) Then, you can just use MITLM toolkit to convert the texts you've collected to the language model

Download MITLM language model toolkit here

http://code.google.com/p/mitlm/

Run it as

estimate-ngram -text voicemail.txt -write-lm your_model.lm

It will create the language model model your_model.lm

4) Sometimes it make sense to mix your specific model with a generic model. It may help if your training text is small or your model is not good enough. To do that download a generic model here:

http://keithv.com/software/giga/lm_giga_5k_nvp_3gram.zip

Then unpack it and interpolate with your voicemail model using MITLM tools:

interpolate-ngram -lm "your_model.lm, lm_giga_5k_nvp_3gram.arpa" -interpolation LI -op voicemail.txt -wl Lectures+Textbook.LI.lm

See MITLM tutorial for details http://code.google.com/p/mitlm/wiki/Tutorial

lm_giga model is quite big, you can also pick hub4 language model for interpolation. To do that you need to convert it to text form from the binary form first:

sphinx_lm_convert -ifmt dmp -ofmt arpa hub4.5000.DMP hub4.lm

One day you will be able to work with a language model using CMU language model toolkit CMUCLMTK, but for now it's more complicated than MITLM. So MITLM is a recommended tool for language model operations.

4) To speedup the startup of the recognizer sort the model and convert it to a binary format:

sphinx_lm_sort < your_model.lm > your_model_sorted.lm
sphinx_lm_convert -i your_model_sorted.lm -o your_model_sorted.lm.dmp

3) In Pocketsphinx script, use your language model for transcription, add the following argument:

-lm your_model_sorted.lm.dmp

That's it.

Between, Google API is mostly trained on search queries. Why it perfectly suitable for voice search it's not good for voicemail transcription either. Voicemail transcription texts are usually quite sensitive information and it's very hard to get free access to them.

I think after this step the accuracy of the transcription is already good enough. You will be able to collect transcription results, fix them and use them to improve the language model.

Acoustic model
Sometimes it's usable to update the acoustic model. This step will require you to compile and setup Sphinxtrain. Again, transcribe few voicemails you've recorded, then organize them into a database. Then follow the acoustic model adaptation HOWTO as described in CMUSphinx wiki:

http://cmusphinx.sourceforge.net/wiki/tutorialadapt

Acoustic model adaptation always make sense but it's quite a time consuming process. Maybe one day someone will automate it to make it really flawless. For example we have started a project to help to train and adapt the model from the set of long files accompanied with text, not with a carefully drafted database. Once this project will be completed it will be way easier to train and adapt the acoustic models. Any help on this is appreciated.

Dictionary
There can be cases when you need to add few words to the dictionary which are missing. For example in step 1 when you adapted the language model you've got few words which are missing in cmu07a.dic. Then it make sense to add them. Just open a dictionary with a text editor, find the appropriate place and change or edit the phonetic pronunciation of the word. For example, CMU dictionary is missing the word "twitter"

twitter T W IH T ER

Usually this step is not needed but if you have for example an accented words or some other unusual words it may help.

Test the model
After you have adapted the model, retranscribe the files you have already collected. Check the accuracy if it's good or not.

Follow up
So here are the directions to take. I understand it's some work but maybe you consider it's worth the effort. We are really trying to make this process easier and your comments on that will be very appreciated.

Chicken-And-Egg in Sphinxbase

$
0
0
Recently Shea Levy pointed me to an issue with a verbose output during pocketsphinx initialization. Basically every time you start pocketsphinx, you get something like


INFO: cmd_ln.c(691): Parsing command line:
pocketsphinx_continuous 
Current configuration:
[NAME][DEFLT][VALUE]
-adcdev
-agcnonenone
-agcthresh2.02.000000e+00
-alpha0.979.700000e-01
-argfile

It's ok for a tool but not a nice thing for the library which should be a small horse in a rig of application. Not every user is happy seeing all this stuff dumped on the screen. And the worst thing is that there is no way to turn it off because "-logfn /dev/null" works only for the output after initialization. So we are looking to have pocketsphinx completely silent.

It appeared to be more complex issue than I thought. Its classical chicken-egg issue when you use configuration framework do configure logging but configuration framework needs to log itself. We just hardcoded the initialization but thinking afterwards I found way more complex and but more rigid approach in log4j description from http://articles.qos.ch/internalLogging.html

Since log4j never sets up a configuration without explicit input from the user, log4j internal logging may occur before the log4j environment is set up. In particular, internal logging may occur while a configurator is processing a configuration file.

We could have simplified things by ignoring logging events generated during the configuration phase. However, the events generated during the configuration phase contain information useful in debugging the log4j configuration file. Under many circumstances, this information is considered more useful than all the subsequent logging events put together.

In order to capture the logs generated during configuration phase, log4j simply collects logging events in a temporary appender. At the end of the configuration phase, these recorded events are replayed within the context of the new log4j environment, (the one which was just configured). The temporary appender is then closed and detached from the log4j environment.

Oh-woh, I will never get enough passion to implement this properly ;) Let it be as is for now.

Sphinxbase command line options are still not good. I'm pretty much lack proper --help, --version and many more nifty getopt things. One day someone should do this.

ICASSP 2011 Part 1 - Thoughts

$
0
0
It seems like ICASSP this year was a great event, it is pity I missed it. Just comparing the keynotes list, ICASSP beats Interspeech 4:0. ICASSP is very technical, Interspeech is for linguists. Compare the two:

Making Sense of a Zettabyte World vs Neural Representations of Word Meanings


New section formats like technical tracks and trends discussions are interesting though I am not sure how they felt in practice.

So this was the reason to spend few days in reading. 1000 papers on speech technology! Huh. Thanks to all authors for their hard work! Well, I found several duplicates in the end.

Main thing I noted is that topics of the research are very sparse, for example
  • Everyone does speaker recognition. Appealing problem statement here is that here is to detect a synthetic speaker. Paper titled "DETECTION OF SYNTHETIC SPEECH FOR THE PROBLEM OF IMPOSTURE" by De Leon at al. hints that there is no solution for that.
  • I got tired to skip pursuits, bandiths and compressive sensing
  • On the other side, increased portion of papers on non-speech signals, cocktail party problem, signal recovery is very interesting to read.
  • Things like DBN features or SCARF decoder are widely represented. You can read about applications of CRF from g2p algorithms to dialogs. But traditional things like search algorithms and adaptation are almost uncovered. 
  • It was suprising to find the session dedictated to multimedia security which must be a gold mine of ideas in particular if you need a topic for a paper. Is there a company selling such products? 
Overall I found several original problem statements as well as inspiring ideas covering very important technology issues. For example it would be nice to implement meeting transcription application with several iPhones to combine streams and later transcribe them using multichannel environment compensation. Several meeting transcription setups and channel separation methods are described in the conference proceedings.

After reading some amount of papers I found that conference papers are too short. While you see a nice title and an abstract you expect to read a detailed insight into the problem with history discourse and everything explained in detail, a deep investigation of the problem. But you get just a description of the technology and few figures from experiments. On the other side, I will not be able to read 100 papers 20 pages each.

Very interesting that this year awards are not related to speech technology. That will be the contents of Part 2. I just need to go through last 50 papers left.

Cars Controlled By Speech

$
0
0
Being a speech recognition guy I'm looking for a car with speech recognition included. Sounds strange to select car just because of it, but that is just kidding. So far the list is:

  • Honda Accord
  • Any Ford 2011
  • Mazda 6
Not listing something expensive like BMW or Mersedes. Hm, it looks almost everyone is doing that. Any others? Which is the most advanced one?

Some details on particular implementation

Ford SYNC 2011

Quite advanced system. Command-based. Supports many types of commands to control dvd or get baseball scores. Supports user profiles but doesn't seem like it has specific training procedure. With current speaker recognition capabilities it could in theory adapt to users automatically without profiles.

Mazda 6 2011

Pretty interesting system, but limited comparing to previous one. According to owner manual it supports a very limited list of commands to manage calls, get incoming messages and. From interesting capabilites it supports training and voice entry for contacts. Three languages - English, French, Spanish. Looks like it's using single microphone. Looks like voice navigation system has separate speech recognition subsystem.
    Honda Fit 2009


    Many commands mostly related to navigation but no user adaptation and no profiles. Alphanumeric entry as a backup to vocabulary search. This one is very simple.


    Mitsubishi/Hyundai 2011


    I didn't manage to find the manual on them. Feature name "Bluetooth hands-free phone system with voice recognition and phonebook download" makes me think it's the same system as in Mazda.


    Talkmatic

    Doesn't seem like this is deployed, but presentation looks impressive

    KIA

    Accoding to SpeechTechMag Microsoft and Kia codeveloped the UVO multimedia and infotainment system, which the Korean automaker rolled out in its new Sportage, Sorento, and Optima models late last year. UVO lets users access media content and connect with people through  quick voice commands without having to navigate hierarchical menus.

    Decoders And Features

    $
    0
    0
    CMUSphinx decoders in a glance, so one can compare. Table is incomplete and imprecise of course.




    sphinx2sphinx3sphinx4pocketsphinx
    Acoustic Lookahead
    -
    -
    +
    -
    Alignment
    +
    +
    +
    -
    Flat Forward Search
    +
    +
    -
    +
    Finite Grammar Confidence
    +
    -
    -
    -
    Full n-gram History Tree Search
    -
    -
    +
    -
    HTK Features
    -
    +
    +
    +
    Phonetic Loop Decoder
    +
    +
    -
    -
    Phonetic
    Lookahead
    +
    +
    -
    +
    PLP features
    -
    -
    +
    -
    PTM Models
    -
    -
    -
    +
    Score Quantization
    +
    -
    -
    +
    Semi-Continuous Models
    +
    +
    -
    +
    Single Tree
    Search
    +
    -
    -
    +
    Subvector
    Quantization
    +
    +
    -
    +
    Time-Switching
    Tree Search
    -
    +
    -
    -
    Tree Search Smear
    -
    +
    +
    -
    Word-Switching
    Tree Search
    -
    +
    -
    -
    Thread Safety
    -
    -
    +
    +
    Keyword Spotting
    -
    -
    +
    -

    And here is the description of the entries

    Specific Applications

    Phonetic Loop Decoder. Phonetic loop decoding requires specialized search algorithm. It's not implemented in Sphinx4 for example.

    Alignment. Given text and the transcription get the word timings.

    Keyword spotting. Search for keyword requires separate search space and different search approach.

    Finite Grammar Confidence. Get confidence estimation for finite state grammar. This is a complex problem which
    require additional operations during search, for example phone loop pass.

    Effective pruning

    Acoustic Lookahead. Using acoustic score for the current frame we can predict the score for the next frame
    and thus prune token early.

    Phonetic Lookahead. Using phonetic loop decoder we can predict possible phones and thus restrict large vocabulary search.

    Features

    HTK Features. CMUSphinx feature extraction is different from HTK (different filterbank and transform). To provide HTK capability one needs to have specific HTK feature extraction.

    PLP features. Type of the features different from traditional MFCC. They are more popular now.

    Search Space

    Flat Forward Search. Search space when word paths aren't joined in lextree. Separated path lets us to apply language model probability earlier. Thus search is more accurate. But because search space is bigger it's also slower. Usually flat search is applied as a second pass after tree search.

    Full n-gram History Tree Search. Tokens which have different n-gram history are tracked separately. For example token for "how are UW " and token for "hello are UW.." are tracked separately. In pocketsphinx such tokens are just joined and only best one survive. Full history search is more accurate but slower and more complex in implementation.

    Word-Switching Tree Search. Separate lextrees are kept for each unigram history. This search is in the middle between the one to keep the full history and another one to drop the history at all.

    Single Tree Search. Lextree tokens don't care about word history. This is faster but less accurate way.

    Time-Switching Tree Search. Lextree states don't care about word history but several lextrees are kept in memory (3-5). In this time switching approach lextrees are switched every frame. Because of that there is higher chance to track both histories.

    Tree Search Smear. Lextree contains unigram probability and thus it's possible to prune token earlier based on the language score.

    Acoustic Scoring

    PTM Models. Models when gaussians are shared across senones with same central phone. So we don't need to calculate gaussians value for each senone, just few values for each central phone. Then using different mixture weights we get senone score. This approach reduce computation required but keeps accuracy on a reasonable level. It's similar to semi-continuous models where gaussians are shared across all senones, not just across senones with same central phone.

    Score Quantization. Acoustic scores in some cases could be represented by just 2 bytes (semi-continuous models and specific feature set). Usually scores are in log domain and shifted by 10 bits. This reduces memory required for acoustic model and for scoring and speeds up the computation in particular on CPU without FPU.

    Semi-Continuous Models. Gaussians are shared across all senones, only mixture weights are different. Such models are fast and usually quite accurate. Usually they are multistream (s2_4x or 1s_c_d_dd with subvector 0-12/13-25/26-38) since separate streams could be better quantized.

    Subvector Quantization. Gaussian selection approach to reduce acoustic scoring. Basically continuous model after training is deconstructed on several subvector gaussians which are shared across senones and thus scored efficiently.

    When Language Models Fail

    Magic Words of Interspeech 2011

    $
    0
    0
    Interspeech 2011 is coming. It going to be an amazing event I suppose. If you are interested what is going on there, let's figure that out.

    To keep things simple we will use Unix command line tools. Sometimes text processing could be fun even with simple commands. Text is still most conventint form of the information presentation, way better than HTML or databases. Of course there is lack for more advanced things like stopword filtering or named entity recognition. Let's hope one day Unix command line will have them.

    1. Download full printable programs of Interspeech 2010 and Interspeech 2011 with wget, dump them to text with lynx and cleanup punctuation with sed.

    2. Dump word counts with SRILM tool ngram-count and cut 1000 most frequent words on list for 2011 with head and sort. Leave all words in 2010 list.

    3. Figure out which of the words in 2011 list are new and do not appear in 2010 list with sort and uniq.

    Suprisingly there will be only 2 new words. They are: i-vector and crowdsourcing.

    Dealing with pruning issues

    $
    0
    0
    I spent a holiday looking on the issues in poketsphinx decoding in fwdflat mode. Initially I thought it's a bug but it appeared that it's just a pruning issue. The result looked like this:

    INFO: ngram_search.c(1045): bestpath 0.00 wall 0.000 xRT
    INFO: <s> 0 5 1.000 -94208 0 1
    INFO: par..grafo 6 63 1.000 -472064 -467 2
    INFO: terceiro 64 153 1.000 -1245184 -115 3
    INFO: as 154 176 0.934 -307200 -172 3
    INFO: emendas 177 218 1.000 -452608 -292 3
    INFO: ao 219 226 1.000 -208896 -181 3
    INFO: projeto 227 273 1.000 -342016 -152 3
    INFO: de 274 283 1.000 -115712 -75 3
    INFO: lei 284 3059 1.000 -115712 -79 3


    Speech recognition is essentially a search for a globally best path in a graph. Beam pruning is used to drop the nodes during the search if node score is worse then the best node like in this picture


    If beam is too narrow, the result might not be the globally best one despite its locally the best. In practice it could lead to complex issues like desribed above. See the word "lei" spans about 2k frames which means about 20 seconds. Another sign of overpruning is number of words scored per frame




    INFO: ngram_search_fwdflat.c(940): 2931 words recognized (1/fr)
    INFO: ngram_search_fwdflat.c(942): 48013 senones evaluated (16/fr)
    INFO: ngram_search_fwdflat.c(944): 9586 channels searched (3/fr)
    INFO: ngram_search_fwdflat.c(946): 3849 words searched (1/fr)
    INFO: ngram_search_fwdflat.c(948): 9602 word transitions (3/fr)


    If you have just one word per frame it's likely an issue.

    More detailed behaviour can be seen if debugging in enabled in sources
    #define __CHAN_DUMP__ 1


    You'll see something like

    BEFORE:
    SSID 2866 610 611 (2608)
    SENSCR -604 -215 -371
    SCORES -1014874 -583095 -583097 -583223
    HISTID 170 170 170 170
    AFTER:
    SSID 2866 610 611 (2608)
    SENSCR -604 -215 -371
    SCORES -1015481 -583315 -583317 -583489
    HISTID 170 170 170 170
    BEFORE:
    SSID 2866 610 611 (2608)
    SENSCR -568 -122 -358
    SCORES -1015481 -583315 -583317 -583489
    HISTID 170 170 170 170
    AFTER:
    SSID 2866 610 611 (2608)
    SENSCR -568 -122 -358
    SCORES -1016052 -583442 -583444 -583696
    HISTID 170 170 170 170


    So you see only one HMM per frame is scored and it doesn't generate any other HMMs

    Since those issues are hard to notice since today we will also issue you a warning in the decoder log. It will look like this:

    WARNING: "ngram_search.c", line 404: Word 'lei' survived for 2764 frames, potential overpruning
    WARNING: "ngram_search.c", line 404: Word 'lei' survived for 2765 frames, potential overpruning


    So you'll be warned if something will go wrong.

    It's very easy to forget about pruning issues because they are not really visible. You'll only get a drop in the accuracy and you might not notice it. And you might think it's a model accuracy not a search accuracy. In practice you need always remember about that:

    Search space configuration and settings have certain effect on the final accuracy and speed.

    Default settings are often wrong for modified models. If you have a new model you need to review all the configuration parameters in order to make sure they work. If there are many parameters, you need to check all of them.

    If pruning errors in your decoder have very small effect it means you haven't optimized your search space properly. You can definitely do better.



    At least we might want to report more useful metrics about pruning in the future.

    Google knows better

    $
    0
    0

    Well, it might be my personal search trained that way

    ICASSP 2012

    $
    0
    0

    I've recently attendted ICASSP 2012 conference in Kyoto, Japan. As expected it was an amazing experience. Many thanks to organizers, technical program committee and the reviewers for their hard work.

    The conference collected more than a thousands experts in signal processing and speech recognition. The total number of submitted papers was more than 2000 and more than 1300 of them were presented. It's enormous amount of information to process and it was really helpful to be there and see everything yourself. Of course, most importantly it's an opportunity to meet the people you work with remotely and talk about speech recognition in person. We talked quite a lot about Google Summer Of Code Project we will run soon. You can expect very interesting features implemented there. It's so helpful to map virtual characters to real people.

    And Kyoto, old acient capital of Japan was just beautiful. It's an amazing place to visit.

    Given the amount of papers and data I think it's critically important to summarize the material or at least to provide some overview on the results presented. I hope that future organizers will fill that gap. And for now here is a not very long list of papers and topics I found interesting this year.

    Plenary talks

    First of all I very much liked two plenary session I attended. The talk by Dr. Chin-Hui Lee was about better acoustic model tools. Though the neural networks doesn't seem to provide a good accuracy the main idea was that without good acoustic model you can not get a good accuracy. The only problem with all approaches like this unfortunately is that they are performed on carefully prepared TIMIT database, thus on perfectly clear speech. Everything gets completely different when you move to the area of spontaneous noisy speech we are usually working on in practical tasks.

    Second talk by Dr. Stephane Mallat was about math ideas in machine learning and recognition tasks. Though not directly related to speech it was talking about wavelets and mathematical invariants. If properly developed such theory could build a very good foundation for the accurate and most importantly the optimal proven speech recogntion.

    Discriminative language models

    One new thing for me was several discriminatively trained language models papers. It seems that most of this work is done using neural networks framework, but I think it could be generalized for training an arbitrary G-level WFST.

    DISTRIBUTED DISCRIMINATIVE LANGUAGE MODELS FOR GOOGLE VOICE-SEARCH
    Preethi Jyothi, The Ohio State University, United States; Leif Johnson, The University of Texas at Austin, United States; Ciprian Chelba, Brian Strope, Google, Inc., United States

    Big Data

    This last paper also belongs to a recently emerging big data trend which was quite well represented on the conference. An analog of nuclear physics in our world which require big investment and huge teams. It seems to be in a very initial state still but it must be a very hot topic in a coming years. Most lead by Google team. Well, you can't expect anything else from Google. Another paper from them is also about things hard to imagine.

    DISTRIBUTED ACOUSTIC MODELING WITH BACK-OFF N-GRAMS Ciprian Chelba, Peng Xu, Fernando Pereira, Google, Inc., United States; Thomas Richardson, University of Washington, United States

    So far Google trains on 87 thousands hours. Imagine that. It didn't help them much yet. They reduced the accuracy from 11% to somewhat like 9.5%.

    Big data paper from CMU is interesting too describing the speedup method for training with for the big amount of speech data:

    TOWARDS SINGLE PASS DISCRIMINATIVE TRAINING FOR SPEECH RECOGNITION
    Roger Hsiao, Tanja Schultz

    Importantly that big data idea it turns into the idea that both acoustic and language model are equivalent and should be trained together. The paper about that is:

    OPTIMIZATION IN SPEECH-CENTRIC INFORMATION PROCESSING: CRITERIA AND TECHNIQUES
    Xiaodong He, Li Deng, Microsoft Research, United States

    We could even go further and state that noise parameters and accurate
    transcription are also part of the training model thus we need to train
    them jointly. Some papers on that subject:

    SEMI-SUPERVISED LEARNING HELPS IN SOUND EVENT CLASSIFICATION
    Zixing Zhang, Björn Schuller, Technische Universität München, Germany

    N-BEST ENTROPY BASED DATA SELECTION FOR ACOUSTIC MODELING
    Nobuyasu Itoh, IBM Research - Tokyo, Japan; Tara N. Sainath, IBM T.J. Watson Research Center, United States; Dan Ning Jiang, Jie Zhou, IBM Research - China, China; Bhuvana Ramabhadran, IBM T.J. Watson Research Center, United States

    Efficient decoders

    If you are interested in efficient decoders, the session on LVCSR was be very interesting. I'd note the following papers:

    EXTENDED SEARCH SPACE PRUNING IN LVCSR
    David Nolden, Ralf Schlüter, Hermann Ney, RWTH Aachen University, Germany


    USING A* FOR THE PARALLELIZATION OF SPEECH RECOGNITION SYSTEMS
    Patrick Cardinal, Gilles Boulianne, CRIM, Canada; Pierre Dumouchel, ETS, Canada

    The idea is to use fast WFST pass for heuristic score estimation for A*.

    JOINING ADVANTAGES OF WORD-CONDITIONED AND TOKEN-PASSING DECODING
    David Nolden, David Rybach, Ralf Schlüter, Hermann Ney, RWTH Aachen University, Germany

    DBN

    Quite many DBN papers but I'm not very interested in them. Microsoft traines DBNs on RT03 task and they get pretty good results. 19% WER compared to baseline 25-27%:

    EXPLOITING SPARSENESS IN DEEP NEURAL NETWORKS FOR LARGE VOCABULARY SPEECH RECOGNITION
    Dong Yu, Microsoft Research, United States; Frank Seide, Gang Li, Microsoft Research Asia, China; Li Deng, Microsoft Research, United States

    Reccurent neural networks are also good:

    REVISITING RECURRENT NEURAL NETWORKS FOR ROBUST ASR
    Oriol Vinyals, Suman Ravuri, University of California at Berkeley, United States; Daniel Povey, Microsoft Research, United States

    Weighted Finate State Transducers

    Whole WFST session was great, in particular I very much liked papers on fillers in WFST as well as the last AT&T paper on uniform discriminative training from WFSTs which gives some insights about the internals of AT&T recognizer.

    SILENCE IS GOLDEN: MODELING NON-SPEECH EVENTS IN WFST-BASED DYNAMIC NETWORK DECODERS
    David Rybach, Ralf Schlüter, Hermann Ney, RWTH Aachen University, Germany

    A GENERAL DISCRIMINATIVE TRAINING ALGORITHM FOR SPEECH RECOGNITION USING WEIGHTED FINITE-STATE TRANSDUCERS
    Yong Zhao, Georgia Institute of Technology, United States; Andrej Ljolje, Diamantino Caseiro, AT&T Labs-Research, United States; Biing-Hwang (Fred) Juang, Georgia Institute of Technology, United States

    Robust ASR

    Robust session was striking. PNCC features seem to perform better than verything else. All other authors were plainly saying their method is good but PNCC is better during their talks. Congratulations to Rich Stern, Chanwoo Kim and other involved.

    POWER-NORMALIZED CEPSTRAL COEFFICIENTS (PNCC) FOR ROBUST SPEECH RECOGNITION Chanwoo Kim, Microsoft Corporation, United States; Richard Stern, Carnegie Mellon University, United States

    Corporations on ICASSP

    It's great to see the more involvement of the industry in the research process. I think it's great that major industry players contribute their knowledge to the open shared pool. Honesly, academic activites need to have more influence from the industry too.

    Check one paper from recently emerged speech corporation, Apple Inc.

    LATENT PERCEPTUAL MAPPING WITH DATA-DRIVEN VARIABLE-LENGTH ACOUSTIC UNITS FOR TEMPLATE-BASED SPEECH RECOGNITION Shiva Sundaram, Deutsche Telekom Laboratories, Germany; Jerome Bellegarda, Apple Inc., United States

    And another one from a great small company EnglishCentral.com

    DISCRIMINATIVE TRAINING FOR SPEECH RECOGNITION IS COMPENSATING FOR STATISTICAL DEPENDENCE IN THE HMM FRAMEWORK Dan Gillick,

    Steven Wegmann, International Computer Science Institute, United States; Larry Gillick, EnglishCentral, Inc., United States

    CRF for Confidence Estimation

    Half of the confidence papers were dealing with CRF. It's actually a nice idea to exploit the fact that low confidence region usually spans multiple words

    CRF-BASED CONFIDENCE MEASURES OF RECOGNIZED CANDIDATES FOR LATTICE-BASED AUDIO INDEXING
    Zhijian Ou, Huaqing Luo, Tsinghua University, China

    Generic models for ASR

    This one is a paper I liked about a long dreamed inclusion of the syllables into the speech recognition model

    SYLLABLE: A SELF-CONTAINED UNIT TO MODEL PRONUNCIATION VARIATION
    Raymond W. M. Ng, Keikichi Hirose, The University of Tokyo, Japan

    I hope this research will go into mainstream one day.

    Conclusion

    I'm sure I skipped a lot of interesting results in this list. That's only mine choice, not the effect of the paper quality. I think you would choose others you prefer more. In any case I enjoyed going deep into such wide collection of the recent research results in speech recognition the conference give me an opportunity to see. Thanks to the authors for their hard work!

    Blizzard Challenge 2012

    $
    0
    0
    This year it's a little bit later, but it's amazing that Blizzard Challenge 2012 evaluation is now online.

    This year it's going to be very interesting. The data to create the voices is taken from audiobooks, and one part of the test includes synthesis of paragraphs. That means that you can actually estimate how TTS built from a public data can perform.

    The links to register are:

    For speech experts:
    http://groups.inf.ed.ac.uk/blizzard/blizzard2012/english/registerexperts.html

    For other volunteers:
    http://groups.inf.ed.ac.uk/blizzard/blizzard2012/english/registerweb.html

    The challenge was created in order to better understand and compare research techniques in building corpus-based speech synthesizers on the same data. The basic challenge is to take the released speech database, build a synthetic voice from the data and synthesize a prescribed set of test sentences. The sentences from each synthesizer will then be evaluated through listening tests.

    Please distribute the second URL as widely as you can - to your colleagues, students, friends, other appropriate mailing lists, social networks, and so on.

    How To Choose Embedded Speech Recognizer

    $
    0
    0
    There are quite many solutions around to build an open source speech recognition system for low-resource device and it's quite hard to choose. For example you need a speech recognition system for a platform like Raspberry Pi and you consider between HTK, CMUSphinx, Julius and many other implementations.

    In order to make an informed decision you need to consider a set of features specifically required to run speech recognition in a low-resource environment. Without them your system will probably be accurate but it also will consume too much resources to be useful. Some of them are:


    Features for the small memory footprint:
    •  Support for a semi-continuous models
    •  Quantized and pruned data structures, mixture weights quantized to 4 bits and pruned, acoustic scores are quantized to 16 bits.
    •  Fixed pointer arithmetics
    •  Bitvector structures
    Features for the fast computation:
    •  Top gaussian selection
    •  Simplified lextree search without cross-word context
    •  Multipass processing with tunable performance on each step
    •  Cache access optimization for increased memory throughput
    •  Downsampling
    •  Phone lookahead
    Support for a popular mobile platforms:
    • Out-of-box support for Android
    • Out-of-box support for IPhone
    • Out-of-box support for embedded Linux systems like Beagleboard
    And quite many other features which are helpful for speech recognition. Except commercial engines the only engine which implements the features above is Pocketsphinx

    http://cmusphinx.sourceforge.net

    You can learn more about pocketsphinx features from the publication:

    http://www.cs.cmu.edu/~dhuggins/Publications/pocketsphinx.pdf

    You can learn how to optimize Pocketsphinx for a low-resource environment from the wiki page:

    http://cmusphinx.sourceforge.net/wiki/pocketsphinxhandhelds

    Training acoustic models for embedded device also has some specifics which are required for Pocketsphinx, so Sphinxtrain is an optimal solution here.

    There are also demos for Android and IPhone

    Recent state of UniMRCP

    $
    0
    0
    The cool project in CMUSphinx ecosystem is UniMRCP. It implements MRCP protocol and backends which allow to use speech functions from the most common telephony frameworks like Asterisk or Freeswitch.

    Beside Nuance, Lumenvox and SpeechPro backends, it supports Pocketsphinx which is amazing and free. The bad side of it is that it's not going to work out of box. The decoder integration is just not done right:

    • VAD does not properly strip silence before it pass audio frames to a decoder, because of that accuracy of recognition is essentially zero.
    • Decoder configuration is not optimal
    • The result is not retrieved as it might be
    Also, UniMRCP is not going to work with recent Asterisk releases like 11, it works with 1.6 as far as I see. The new API is not supported.

    So, a lot of work is needed to make it actually working and not confusing the users who build their models with CMUSphinx. However, the perspectives are amazing so one might want to spend some time on finalizing Pocketsphinx plugin in UniMRCP. I hope to see it soon.

    Viewing all 51 articles
    Browse latest View live