Building a Generic Langauge Model

January 2, 2013, 2:13 pm

≫ Next: Around noise-robust PNCC features

I spent some time recently building a language model from the open Gutenberg texts, it has been released today:

http://cmusphinx.sourceforge.net/2013/01/a-new-english-language-model-release/

Unfortunately, it appeared that it's very hard to build a model which is relatively "generic". The language models are very domain-dependent, it's almost impossible to build a good language model for every possible text. Books are almost useless for conversational transcription, no matter what amount of book texts do you have. And, you need terabytes of data in order to reduce accuracy just 1%.

Still, the released language model is an attempt to do so. More importantly, the source texts used to build the langauge model are more-or-less widely available, so it will be possible to extend and improve the model in the future.

I found it quite interesting to solve this problem of domain-dependence. Despite the common fact that trigram models work "relatively well", in fact they do not. This survey I find very relevant

TWO DECADES OF STATISTICAL LANGUAGE MODELING WHERE DO WE GO FROM HERE? by Ronald Rosenfeld

Brittleness across domains: Current language models are extremely sensitive to changes in the style, topic or genre of the text on which they are trained. For example, to model casual phone conversations, one is much better off using 2 million words of transcripts from such conversations than using 140 million words of transcripts from TV and radio news broadcasts. This effect is quite strong even for changes that seem trivial to a human: a language model
trained on Dow-Jones newswire text will see its perplexity doubled when applied to the very similar Associated Press newswire text from the same time period...

Recent advances in language models for speech recognition include discriminative language models which are impractical to build unless you have unlimited processing power

Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm

by Brian Roark et al.

And recurrent neural-network language model implemented by RNNLM toolkit

Mikolov Tomáš, Karafiát Martin, Burget Lukáš, Černocký Jan, Khudanpur Sanjeev: Recurrent neural network based language model, In: Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH 2010), Makuhari, Chiba, JP

RNNLM are becoming very popular and they could provide some significant gains in perplexity and decoding WER, but I don't believe they could solve domain-dependency issue to enable us to build a truely generic model.

So, research on the subject is still needed. Maybe, once the domain-dependency variance could be properly captured, a way more accurate langauge model could be built.

↧

Around noise-robust PNCC features

June 25, 2013, 3:22 am

≫ Next: Mixer 6 database release by LDC & Librivox

≪ Previous: Building a Generic Langauge Model

Last week I've been working on PNCC features which are famous features for speech recognition by Chanwoo Kim and Richard Stern. I made quite some experiments with parameters and research around PNCC. Here are some thoughts on that.

The fundamental paper about PNCC is C. Kim and R. M. Stern. Power-Normalized Cepstral Coefficients for robust speech recognition. IEEE Trans, Speech, Audio, Lang. Process but for detailed explanation of the process and experiments one can look in C. Kim Signal Processing for Robust Speech Recognition Motivated by Auditory Processing, Ph. D Thesis. The code for Octave is available too. The C implementation is also available in bug tracking system, thanks to Vyacheslav Klimkov, and will be committed soon after some cleanup. I hope Sphinx4 implementation will follow.

However, quite some important information is not contained in papers. The main pipeline of PNCC is similar to the one of the conventional MFCC except few modifications. First, a gammatone filterbank is used instead of triangular filterbank. Second, filterbank energies are filtered to remove noise and reverberations effects.And third, power law nonlinearity together with power normalization is applied. Most of the pipeline design is inspired by research on human auditory subsystem.

There is a lot of research on using auditory ideas including power law nonlinearities, gammatone filterbanks and so on in speech recognition and PNCC papers do not cover it fully. Important ones are fundamental paper about RASTA and some recent research about auditory-inpired features like Gammatone Features and Feature Combination by Shulter at al.

PNCC design using auditory features raises quite fundamental questions and not discussed in the papers above though, a one very important paper is Spectral Signal Processing for ASR (1999) by Melvyn Hunt from Dragon. The idea from the paper is:

The philosophical case for taking what we know about the human auditory system as an inspiration for the representation used in our automatic recognition systems was set out in the Introduction, and it seems quite strong. Unfortunately, there does not seem to be much solid empirical evidence to support this case. Sophisticated auditory models have not generally been found to be better than conventional representations outside the laboratories in which they were developed, and none has found its way into a major mainstream system. Certainly, there are successful approaches and features that are generally felt to have an auditory motivation—the use of the mel-scale, the cube-root representation, and PLP. However, this paper has sought to show that they have no need of the auditory motivation, and their properties can be better understood purely signal processing terms, or in some cases in terms of the acoustic properties of the production process. Other successful approaches, such as LDA made no pretense of having an auditory basis.

This idea is very important because PNCC paper is very experimental one and doesn't really cover the theory behind the design of the filterbank. There are good things in PNCC design and not so clear things too. Here are some observations I had:

1. PNCC is really simple and elegant feature extraction, all the steps could be clearly understood and that makes PNCC very attractive. Noise robustness properties are really great too.

2. Noise filtering does reduce the accuracy in clean conditions, usually this reduction is visible (about 5% relative) but can be justified since we get quite a good improvement in noise. Despite there is a claim that PNCC is better than MFCC on clean data, my experiments do not confirm that. PNCC paper never provide exact numbers only the graphs that makes it very hard to verify their findings.

3. Band bias subtraction and temporal masking are indeed very reasonable stages to apply in feature extraction pipeline. Given the noise is mostly additive with slowly changing spectrum it's easy to remove noise using long-term integration and analog of the Wiener filtering.

4. Gammatone filterbank doesn't improve significantly over triangular filterbank so essentially it's complexity is not justified.Morever, default PNCC filterbank is suboptimal compared to good tuned MFCC filterbank. The filterbank starts only from 200Hz so for most broadcast recordings it has to be changed to 100Hz.

5. Power law nonlinearlity is mathematically not reasonable since it doesn't help to transform channel modification into the simple addition to be removed with CMN lately. The tests were done on normalized database like WSJ while every real database will show the reduction in performance due to the complex power law effects. Overall power normalization with moving average makes things even worse and reduces the ability to normalize scaled audio on training and decoding stages, for example for very short utterances it's really hard to estimate the power properly. Power nonlinearity could be compensated with variance normalization but there are no signs in the PNCC papers about that. So my personal choice is shifted log nonlinearity which is log for high energies and have a shift at low end to deal with noise. Log is probably a bit less accurate with noise but it is stable and have good scaling properties.

6. For raw MFCC lifter for coefficients has to be applied for best performance or LDA/MLLT has to be applied to make features more gaussian-like. Unfortunatly, PNCC paper doesn't tell anything about liftering or LDA/MLLT. With LDA the results could be quite different from the ones reported.

Still, PNCC seem to provide quite a good robustness in noise and I think PNCC will provide improved performance for default models. The recent plan is to import PNCC into pocketsphinx and sphinx4 as default features and train the models for them.

↧

Mixer 6 database release by LDC & Librivox

August 21, 2013, 6:31 am

≫ Next: System Combination WER

≪ Previous: Around noise-robust PNCC features

LDC has recently announced availability of a very large speech database for acoustic model training. A database named Mixer 6 contains incredible amount of 15000 hours of transcribed speech data by few hundred speakers.While commercial companies have access to a significantly bigger sets, Mixer is the biggest data set compared to databases used in research ever before. Previously available Fisher database has only around 2000 hours.

It would be really interesting to see the results obtained with this database, data size should improve the existing system performance. However, I see that this dataset will pose some critical challenge to the research and development community. Essentially, such data size means that it will be very hard to train a model using conventional software and accessible hardware. For example, it takes about a week and a decent cluster to train a model using 1000 hours, with 15000 hours you have to wait several months unless more efficient algorithms will be introduced. So, it is not easy.

On the other hand, we have access to a similar amount of data - a Librivox archive contains way more high-quality recordings with text available. It certainly must be a focus of the development to train a model on Librivox data. Such a training is not going to be straightforward too - new algorithms and software must be created. A critical issue is to design an algorithm which will improve the accuracy of the model without the need to process the whole dataset. Let me know if you are interested in this project.

Between, Librivox accepts donations and they are definitely worth them.

↧

System Combination WER

December 14, 2013, 2:42 pm

≫ Next: Very simple but very important thing to properly model the language

≪ Previous: Mixer 6 database release by LDC & Librivox

There is one thing I usually wonder about while reading the next conference paper on speech recognition. The usual paper limit is 4 pages and the authors usually want to write exactly 4 pages. What should you do if you don't have enough information? Right, you can build exactly same systems with PLP features and MFCC features and probably with some other features and you can add one more table about system combination WER and probably add one graph too or you can mix two types of LM and report another nice graph.

This practice has been started long long time ago during NIST evaluations I think, when participants reported system combination WER. NIST even invented ROVER algorithm for better combination.

For me personally such content in a paper reduces quality of the paper significantly. The system combination WER was never meaningful addition. Yes, it's well known that if you combine MFCC with PLP you can reduce WER by 0.1% and probably you will be able to win the competition. From scientific point of view this result adds zero new information, it just a filler for the rest of your paper. Also, to get a combination result of 5 systems you usually spend 5 times more computing individual results. Not worth for 0.1% improvement, you can usually get the same with slightly wider beams.

So instead consider doing something else, try to cover the algorithms you used and explain why do they work, try to describe the troubles you've solved, try to add new questions you consider interesting. At least try to collect more references and write a good overview on the previous research. That will save your time, reader's time and the computing power you used to build another model.

↧

Very simple but very important thing to properly model the language

May 4, 2014, 11:25 am

≫ Next: Should we listen our models

≪ Previous: System Combination WER

If I would be a scientific advisor I would give my student the following problem:

Take a text, take an LM, computer perplexity:

file test.txt: 107247 sentences, 1.7608e+06 words, 21302 OOVs 0 zeroprobs, logprob= -4.06198e+06 ppl= 158.32 ppl1= 216.345

Join every two lines in text:

awk 'NR%2{printf "%s ",$0;next}{print;}' test.txt > testjoin.txt

Test again:

file testjoin.txt: 53624 sentences, 1.7608e+06 words, 21302 OOVs 0 zeroprobs, logprob= -4.05859e+06 ppl= 183.409 ppl1= 215.376

This is a really serious issue for decoding of conversational speech, the perplexity raised from 158 to 183, in real-life cases it's getting even worse. WER drops accordingly. So many times utterances contain several sentences and it's really crazy that our models can't handle that properly.

↧

Should we listen our models

July 4, 2015, 4:37 pm

≫ Next: On SANE 2015 Videos on Signal Separation

≪ Previous: Very simple but very important thing to properly model the language

I've recently met an interesting paper worth consideration

Rethinking Algorithm Design and Development in Speech Processing
by Thilo Stadelmann et al

This is not mainstream research, but it is exactly what makes it interesting. The main idea of the paper is that to understand and develop speech algorithms we need to advance our tools to assist our intuition. This idea is quite fundamental and definitely has interesting extensions.

Modern tools are limited, most developers only check spectrograms and never visualize distributions, lattices or the context dependency trees. N-grams are also rarely visualized. In speech the paper suggests to build tools not just to view our models, but also to listen for them. I think this is quite a productive idea.

In modern machine learning tools visualization definitely helps to extend our understanding of complex structures. Here a terrific Colah's blog comes to mind. It would be interesting to extend this beyond pictures.

↧

On SANE 2015 Videos on Signal Separation

November 23, 2015, 3:33 pm

≫ Next: Harmonic Noise Model in Speech Recognition

≪ Previous: Should we listen our models

Recently a great collection of videos from Speech and Audio in the Northeast (SANE) 2015 workshop has been shared. The main topic of the workshop was sound signal separation which I consider very important direction of research for the new future, something that would be critical to solve to get human-like performance of speech recognition systems.

We did some experiments with NMF and other methods to robustly recognize overlapped speech before but my conclusion is that unless training and test conditions are carefully matched the whole system does not really work, anything unknown on the background really destroys recognition result. For that reason I was very interested to check recent progress in the field. The research is pretty early stage but there are very interesting results for sure.

The talk by Dr. Paris Smaragdis is quite useful to understand connection between non-negative matrix factorization and more recent approach with neural networks which also demonstrate how neural network works by selecting principal components from the data.

One interesting bit from the talk above is the announcement of the bitwise neural networks which are very fast and effective way to classify inputs. I believe it could be another big advancement in the performance of the speech recognition algorithms. The details could be found in the following publication: Bitwise Neural Networks by Minje Kim and Paris Smaragdis. Overall, the idea of the bit-compressed computation to reduce memory bandwidth seem very important (LOUDS language model in Google mobile recognizer also from this area). I think NVIDIA should be really concerned about it since GPU is certainly not the device this type of algorithms need. No more need in expensive Teslas.

Another interesting talk was by Dr.Tuomas Virtanen in which a very interesting database and the approach to use neural networks for separation of different event types is presented. The results are pretty entertaining.

This video also had quite important bits, one of them is the announcement of Detection and Classification of Acoustic Scenes and Events Challenge 2016 (DCASE 2016) in which acoustic scene classification would be evaluated. The goal of acoustic scene classification is to classify a test recording into one of predefined classes that characterizes the environment in which it was recorded — for example "park", "street", "office". The discussion of the challenge which starts soon is already going in the challenge group, this would be very interesting to participate.

↧

Harmonic Noise Model in Speech Recognition

January 11, 2016, 11:44 am

≫ Next: IWSLT 2015

≪ Previous: On SANE 2015 Videos on Signal Separation

Recently I came around a nice demo about generation of natural sounds from physical models. This is really an exciting topic because while Hollywood can now draw almost everything like Star Wars, the sound generation is pretty limited and unexplored area. For example, really high quality speech still can not be created by computers, no matter how powerful they are. This leads to a question of speech signal representation.

Accurate speech signal representations made a big difference in different areas of speech processing like TTS, voice conversion, voice coding. The core idea is very simple and straightforward but also powerful - we notice the fact that acoustic signals are either produced by harmonic oscillation in which case it has structure or by a turbulence cavitation in which case we see something like white noise. In speech such classes are represented by vowels and sibilant consonants, everything else is a mixture of those with some degree of turbulence and some degree of structure. However, this does not really speech-specific, all other real world signals except artificial ones might be analyzed from this point of view.

Such representation allowed to greatly improve voice compression in the class of MELP codecs (mixed excitations linear prediction). Basically we represent the speech as noise and harmonics and compress them separately. That allowed to improve compression of speech signal to unbelievable 600b/s. Mixed excitation was very important in text-to-speech synthesis. And it really made a big difference, as was proven quite some time ago by Mixed excitation for HMM-based speech synthesis by Takayoshi Yoshimura at al. 2001.

Unfortunately there is very little published research on mixed excitation models for speech recognition. I only found a paper A harmonic-model-based front end for robust speech recognition by Michael L. Seltzer which does consider harmonic and noise model but focus on robust speech recognition and not the advantages of the model itself. However, I believe such model can be quite important for speech analysis because it allows to classify speech events with very high degree of certainty. For example, if you consider a task of creating TTS system from voice recording, you might still notice that even best algorithms still confuse sounds a lot, assign incorrect boundaries, select wrong annotation. More accurate signal representation could help here.

It would be great if readers share more links on this, thank you!

↧

IWSLT 2015

January 12, 2016, 7:26 am

≫ Next: The case against probabilistic models in metric spaces

≪ Previous: Harmonic Noise Model in Speech Recognition

IWSLT 2015 proceedings recently appeared. This is an important competition in ASR focused on TED talks translation (and, more interesting for us, transcription).

Best system from MITLL-AFRL had a nice WER 6.6%.

It is interesting that most of the winner system (same was in MGB challenge Cambridge system ) were using combinations of customized HTK + Torch and Kaldi. Kaldi alone does not get the best performance (11.4%), plain custom HTK is usually better with WER 10.0% (see Table 8). And combination usually gives ground-breaking result.

There is something interesting here.

↧

The case against probabilistic models in metric spaces

April 19, 2016, 11:59 am

≫ Next: Future of texts

≪ Previous: IWSLT 2015

A recent discussion on kaldi group about OOV words reminded me about this old problem.

One of the things that makes modern recognizers so unnatural is probabilistic models behind them. It's a core design decision to build the recognizer on terms of probability of classes and use models which are all probabilistic. Probabilistic models are easy to estimate, but they do not often fit the reality.

In the most common situation, if you have two classes A and B and garbage class G, a point from the garbage is either estimated as A or B and it is very hard to properly classify it as G. While probability of the signal is easy to estimate from the database based on examples, probability of the garbage is very hard. You need to have a huge database of garbage examples or you will not be able to get the garbage estimate properly. As a result, the current systems can not drop non-speech sounds and often create very misleading hypothesis. Bad things also happen in training, incorrectly labelled examples significantly disturb correct probability estimation and model has no means to detect them.

And in a long term the chase for probabilistic model is getting worse, everything is reduced to probabilistic framework. People talk about graphical models, Gaussian processes, stick-breaking model, Monte-Carlo sampling when they simply need to optimize the number of Gaussians in the mixture with a simple cost function. And they never tell you can simply train 500 Gaussians mixture and that will work equally well.

Same issue you might see in search engines, you can not use "not" in the search, for example, you can not search for a "restaurant not on the river bank". Though some companies try to implement such search, this effort is not widespread yet.

Situation slightly changes if we consider some real space of variants, for example a metric space. Much more reasonable decision might be made with geometrical models. You just look on the distance between the observation and the expectation and make a decision based on certain threshold. Of course you need to train the threshold and the distance function but this decision relies only on observation and the distance, not on the probability of everything else. Yes, I'm talking about plain old SVMs.

Metric is really the key here, with generic space indeed you can not invent something more advanced than simple bayesian rule. However, in presence of metric you might hope that you'll get much more interesting results from using it or at least combining metric decision with probabilistic decision.

Unfortunately there is no much information about it on the net, almost all AI books start with probabilistic reasoning as a natural approach to intelligence. I found some research like this paper, but it is far from being complete. Any links on more complete research on the topic would be really appreciated.

↧

Future of texts

July 18, 2016, 2:07 am

≫ Next: Learning with huge memory

≪ Previous: The case against probabilistic models in metric spaces

It seems that people will loose the ability to read, comprehend and remember long texts soon, the question now is - is it possible to deliver very complex messages without texts?
The critical issue is to design a flow of information into human brain which will both allow to scan though extremely large amounts of data and deduct new meanings. Text/speech is indeed quite slow channel for that, vision might be reasonable.

Visualization seems relevant if we want to keep human intelligence instead of replacing it with pure computer intelligence. Works like LargeVis

Visualizing Large-scale and High-dimensional Data
by Jian Tang, Jingzhou Liu, Ming Zhang, Qiaozhu Mei

are much more important then. See also the LargeVis project on github.

↧

Learning with huge memory

January 3, 2017, 1:57 pm

≫ Next: When information is already lost

≪ Previous: Future of texts

Recently a set of papers were published about "memorization" in neural networks. For example:

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

also

Understanding deep learning requires rethinking generalization

It seems that large memory system has a point, you don’t need millions of computing cores in CPU and, it is too power-expensive, you could just go ahead with very large memory and reasonable amount of cores to access memory with hashing (think of Shazam or randlm, or G2P by analogy). You probably do not need heavy tying either.

Advantages are: you can quickly incorporate new knowledge, just put new values in memory, you can model corner cases since they are all still accessible, and, again, you are much more energy-efficient.

Maybe we will see mobile phones with 1Tb of memory sometimes.

↧

When information is already lost

April 9, 2018, 8:54 am

≫ Next: Goodbye Google+

≪ Previous: Learning with huge memory

In speech recognition we frequently deal with noisy or simply corrupted recordings. For example, in call center recordings you still get error rates like 50% or 60% even with the best algorithms. Someone calls while driving, others on the noisy street. Some use really bad phones. And the question raises how to improve the accuracy for such recordings. People try different complex algorithms and train model on GPU for many weeks while the answer is simple - the information in such recordings is simply lost and you can not decode them accurately.

Data transfer and data storage are expensive and VoIP providers often save every cent since they all operate on very low margin. That means they often use codecs with bugs or bad transmission lines and as a result you simply get unintelligible speech. Then everyone uses cell phones thus you have multiple encoding-decoding rounds where information from the microphone is encoded with AMR then encoded into 729 and finally converted into MP3 for storage, you have many codecs and often frame drops. As a result the quality of sound is sometimes very bad. And recognition accuracy is zero.

The quality of speech is hard to measure but there are ways to do that. The easiest way requires controlled experiments where you send the data from one endpoint and measure distortion on another endpoint. There are also single-side tools developed by ITU in PESQ 563 that simply take an audio file and give you the sound quality score which takes many parameters into account and estimates the quality of the speech. They are rough, but still can give you some idea how noisy your audio is. And if it is really noisy, the way to improve it is not to apply better algorithms and put more research in speech recognition but simply go to the VoIP provider and demand better sound quality.

Given we have such a tool we might want to introduce the normalized word error rate which takes into account how good the recording is. So you really want to decode high quality recordings accurately and you probably do not care about bad quality recordings.

When accuracy matters, sound quality is really important. If possible you can use your own VoIP stack sending audio directly from the mobile microphone to the server. But when calls come to play, it is usually hopeless.

↧

Goodbye Google+

March 27, 2019, 1:49 pm

≫ Next: The theory of possibilities

≪ Previous: When information is already lost

Dear friends, as you know Google+ is shutting down. I considered several alternatives: Facebook, Quora, Linkedin, my old blog, Reddit, Twitter, Telegram. Unfortunately there are things I dislike in all of them.

In the age of big data we can certainly confirm that big data is just a big trash. The real ideas are always hidden from public discussions under the layers of disinformation. Incognito technologies are the ones that going to rule in the next technological age. The platform to support hidden data is not going to be public anyway.

Finally, my own stupid content is nothing but ramblings, I just translate the ideas I see around the web. It is way more interesting to read about my friends and colleagues. Not just technology news, but tiny personal things, even opinionated are of great importance and interest. Very sad to loose such a place that Google+ was. Ideally it would be nice to meet and discuss in person but its not always easy.

So it would be certainly great if you'll join few of the sites below and share your ideas and knowledge, I hope it will be beneficial for all of us.

https://t.me/speech_recognition - the new Telegram channel about speech recognition. I like Telegram for the UI simplicity and speed, for the elegance of technical solutions and extension capabilities. I'm also quite excited of Russian origins of Telegram, supposedly the product of Russian intelligence agencies.

https://www.quora.com/q/usejrrgnezvhiyup - the space on Quora. I find some content on Quora quite offensive, but I also find many extremely interesting answers there from really nice people. I also find that Quora is very helpful to establish new connections and promote the ideas.

https://www.linkedin.com/groups/8614109/ - over the years I found Linkedin extremely useful for business and establishing connections. They screwed group discussions, they screwed the UI, they lost many opportunities, but they still remain the top business network. Hopefully they will catch up with the issues they have. I hope the group is going to be useful channel to get in touch and learn more about recent developments.

I also hope to continue with the blog, update our company website and continue with the developments. More on that in the future.

Please let me know what is your opinion.

↧

The theory of possibilities

June 12, 2019, 6:47 am

≫ Next: The masking problem - capsules, specaug, bert

≪ Previous: Goodbye Google+

I've got quite interested in the future prediction these days, one nice idea by Russian writer Sergey Borisovich Pereslegin is that we should build the future based on the theory of possibilities rather than the theory of probabilities. This is a very deep idea actually.

The probability theory is very common these days and everyone is applying Bayes things here and there. But the problem with probability theory it can only predict probable thing which are known before or have been observed before.

The theory of possibilities can discover new unknown things.

This is quite a researched subject surprisingly, for example one can check
Possibility Theory and its Applications: Where Do we Stand ? by Didier Dubois and Henri Prade

↧

The masking problem - capsules, specaug, bert

August 25, 2019, 2:44 pm

≫ Next: Information flows of the future

≪ Previous: The theory of possibilities

An important issue with a modern neural networks is their vulnerability to the masked corruption, that is the random corruption of some small amount of samples in the image or sound. It is well known that human is very robust about such noise, a man can ignore slightly but randomly corrupted pictures, sounds, sentences. MP3 compression is using masking to drop unimportant bits of sounds. Random impulse background noise usually has little effect on speech recognition by human. On the other hand it is very easy to demonstrate that modern ASR is extremely valuable to random spontaneous noise and that really makes a difference, even slight change of some frequencies can harm the accuracy a lot.

Hinton understood this problem and that is why he proposed capsule networks as a solution. The idea is that by using an agreement between a set of experts you can get more reliable prediction ignoring unreliable parts. Capsules are not very popular yet, but they were exactly thought to solve the masking problem.

On the other hand, Google/Facebook/OpenAI tried to solve the same problem with more traditional networks. They still use deep and connected architectures, but they decided to corrupt the dataset with masks during the training and teach the model to recognize it. And it does work well too, for example, remember SpecAugment success in speech recognition, BERT/ROBERTA/XLM in NLP are very good examples too.

On the path to reproduce this idea it is important to understand one thing. Since neural network effectively memorizes the input, to properly recognize masked images trainer has to see all possible masks and has to store their vectors in a network. That means - training process has be much much longer and the network has to be much much bigger. We see that in BERT. Kaldi people also saw it when they tried to reproduce SpecAugment.

Given that, some future ideas:

1. SpecAugment is not really random masking, it either drops the column or the raw. I predict more effective masking would be to randomly drop the 15% of the values on the whole 2-d spectrum, something Bert-style. I think in the near future we shall see that idea implemented.

2. The idea of masking can be applied to other sequence modeling problems in speech, for example, in TTS, we shall see it soon in vocoders and in transformer/tacotron models.

3. The waste of resources for training and decoding with masking is obvious, a more intelligent architecture to recognize masked inputs might change the things significantly.

Thanks to Robit Mann on @cmusphinx for the initial idea.

↧

Information flows of the future

September 8, 2019, 2:00 pm

≫ Next: Selected Papers Interspeech 2019 Monday

≪ Previous: The masking problem - capsules, specaug, bert

It is interesting how similar ideas raise here and there in seemingly unrelated context. The recent quote from Actionable Book Summary: The Inevitable by Kevin Kelly:

And what’s next probably looks like this: Imagine zillion streams of information interacting with each other, communicating, pulsating. A new type of computer, tracking and recording everything we do. The future will be less about owning stuff and more about being part of flowing information that will supposedly make our lives easier.

Compare to the quote from Secushare draft:

secushare employs GNUnet for end-to-end encryption and anonymizing mesh routing (because it has a more suitable architecture than Tor or I2P) and applies PSYC on top (because it performs better than XMPP, JSON or OStatus) to create a distributed social graph.

↧

Selected Papers Interspeech 2019 Monday

September 16, 2019, 6:56 am

≫ Next: Selected Papers Interspeech 2019 Tuesday

≪ Previous: Information flows of the future

Overall, it is going pretty good. Many very good papers, diarization joins with decoding, everything goes to the right direction.

RadioTalk: a large-scale corpus of talk radio transcripts

Doug Beeferman (MIT Media Lab), William Brannon (MIT Media Lab), Deb Roy (MIT Media Lab)

248000 hours dataset
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2714.pdf

Automatic lyric transcription from Karaoke vocal tracks: Resources and a Baseline System

Gerardo Roa (University of Sheffield), Jon Barker (University of Sheffield)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2378.pdf

https://github.com/groadabike/Kaldi-Dsing-task

Speaker Diarization with Lexical Information

Tae Jin Park, Kyu J. Han, Jing Huang, Xiaodong He, Bowen Zhou, Panayiotis Georgiou, Shrikanth Narayanan

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1947.pdf

Full-Sentence Correlation: a Method to Handle Unpredictable Noise for Robust Speech Recognition

Ming Ji (Queen's University Belfast), Danny Crookes (Queen's University Belfast)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2127.pdf

Untranscribed Web Audio for Low Resource Speech Recognition

Andrea Carmantini, Peter Bell, Steve Renals

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2623.pdf

Building Large-Vocabulary ASR Systems for Languages Without Any Audio Training Data

Manasa Prasad, Daan van Esch, Sandy Ritchie, Jonas Fromseier Mortensen

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1775.pdf

How to annotate 100 hours in 45 minutes

Per Fallgren (KTH Royal Institute of Technology), Zofia Malisz (KTH, Stockholm), Jens Edlund (KTH Speech, Music and Hearing)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1648.pdf

Exploiting semi-supervised training through a dropout regularization in end-to-end speech recognition

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/3246.pdf

High quality - lightweight and adaptable TTS using LPCNet

Zvi Kons (IBM Haifa research lab), Slava Shechtman (Speech Technologies, IBM Research AI), Alexander Sorin (IBM Research - Haifa), Carmel Rabinovitz (IBM Research - Haifa), Ron Hoory (IBM Haifa Research Lab)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1705.pdf

http://srv-wtts.haifa.il.ibm.com/TTS-voice-conversion-IS2019/

Very nice quality

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Ziping Zhao, Zhongtian Bao, Zixing Zhang, Nicholas Cummins, Haishuai Wang, Björn W. Schuller

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1649.pdf

Large-Scale Mixed-Bandwidth Deep Neural Network Acoustic Modeling for Automatic Speech Recognition

Khoi-Nguyen Mac (University of Illinois at Urbana-Champaign), Xiaodong Cui (IBM T. J. Watson Research Center), Wei Zhang (IBM T. J. Watson Research Center), Michael Picheny (IBM T. J. Watson Research Center)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2641.pdf

An Investigation into On-Device Personalization of End-to-End Automatic Speech Recognition Models

Khe Chai Sim, Petr Zadrazil, Françoise Beaufays

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1752.pdf

↧

Selected Papers Interspeech 2019 Tuesday

September 16, 2019, 2:48 pm

≫ Next: Selected Papers Interspeech 2019 Wednesday

≪ Previous: Selected Papers Interspeech 2019 Monday

Spatial and Spectral Fingerprint in The Brain: Speaker Identification from Single Trial MEG Signals

Oral; 1000–1020

Debadatta Dash (The University of Texas at Dallas), Paul Ferrari (University of Texas at Austin), Jun Wang (University of Texas at Dallas)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/3105.pdf

Investigating the robustness of sequence-to-sequence text-to-speech models to imperfectly-transcribed training data

Jason Fong (University of Edinburgh), Pilar Oplustil (University of Edinburgh), Zack Hodari (University of Edinburgh), Simon King (University of Edinburgh)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1824.pdf

Using pupil dilation to measure cognitive load when listening to text-to-speech in quiet and in noise

Poster; 1000–1200

Avashna Govender (The Centre for Speech Technology Research, University of Edinburgh), Anita E Wagner (Graduate School of Medical Sciences, School of Behavioural and Cognitive Neurosciences, University of Groningen), Simon King (University of Edinburgh)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1783.pdf

Leveraging Acoustic Cues and Paralinguistic Embeddings to Detect Expression from Voice

Poster; 1000–1200

Vikramjit Mitra (Apple Inc.), Sue Booker (Apple Inc.), Erik Marchi (Apple Inc), David Scott Farrar (Apple Inc.), Ute Dorothea Peitz (Apple Inc.), Bridget Cheng (Apple Inc.), Ermine Teves (Apple Inc.), Anuj Mehta (Apple Inc.), Devang Naik (Apple)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2998.pdf

Acoustic Modeling for Automatic Lyrics-to-Audio Alignment

Chitralekha Gupta (National University of Singapore), Emre Yilmaz (National University of Singapore), Haizhou Li (National University of Singapore)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1520.pdf

STC Antispoofing Systems for the ASVspoof2019 Challenge

Galina Lavrentyeva (ITMO University, Speech Technology Center), Sergey Novoselov (ITMO University, Speech Technology Center), Tseren Andzhukaev (Speech Technology Center), Marina Volkova (Speech Technology Center), Artem Gorlanov (Speech Technology Center), Alexandr Kozlov (Speech Technology Center Ltd.)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1768.pdf

Developing Pronunciation Models in New Languages Faster by Exploiting Common Grapheme-to-Phoneme Correspondences Across Languages

Harry Bleyan (Google), Sandy Ritchie (Google), Jonas Fromseier Mortensen (Google), Daan van Esch (Google)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1781.pdf

Multilingual Speech Recognition with Corpus Relatedness Sampling

Xinjian Li, Siddharth Dalmia, Alan W. Black, Florian Metze

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/3052.pdf

On the Use/Misuse of the Term 'Phoneme'

Roger Moore (University of Sheffield), Lucy Skidmore (University of Sheffield)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2711.pdf

↧

Selected Papers Interspeech 2019 Wednesday

September 17, 2019, 1:59 pm

≫ Next: Spectre and deep learning

≪ Previous: Selected Papers Interspeech 2019 Tuesday

A Highly Efficient Distributed Deep Learning System for Automatic Speech Recognition
Wei Zhang, Xiaodong Cui, Ulrich Finkler, George Saon, Abdullah Kayi, Alper Buyuktosunoglu, Brian Kingsbury, David Kung, Michael Picheny
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2700.pdf
Cool merge graphs

Detection and Recovery of OOVs for Improved English Broadcast News Captioning
Samuel Thomas (IBM Research AI), Kartik Audhkhasi (IBM Research AI), Zoltan Tuske (IBM Research AI), Yinghui Huang (IBM Research AI), Michael Picheny (IBM Research AI)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2793.pdf
Nothing new but still important

Disfluencies and Human Speech Transcription Errors
Vicky Zayats (University of Washington), Trang Tran (University of Washington), Courtney Mansfield (University of Washington), Richard Wright (University of Washington), Mari Ostendorf (University of Washington)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/3134.pdf

Robust Sound Recognition: A Neuromorphic Approach
Jibin Wu (National University of Singapore), Zihan Pan , Malu Zhang , Rohan Kumar Das , Yansong Chua , Haizhou Li
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/8032.pdf
Spiking neural networks

Neural Named Entity Recognition from Subword Units
Abdalghani Abujabal (Max Planck Institute for Informatics), Judith Gaspers (Amazon)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1305.pdf
Names recognition is still important

Unsupervised Acoustic Segmentation and Clustering using Siamese Network Embeddings
Saurabhchand Bhati (The Johns Hopkins University), Shekhar Nayak (Indian Institute of Technology Hyderabad), Sri Rama Murty Kodukula (IIT Hyderabad), Najim Dehak (Johns Hopkins University)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2981.pdf

Acoustic Model Bootstrapping Using Semi-Supervised Learning
Langzhou Chen (Amazon Cambridge office), Volker Leutnant (Amazon Aachen office)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2818.pdf

Bandwidth Embeddings for Mixed-bandwidth Speech Recognition
Gautam Mantena (Apple Inc.), Ozlem Kalinli (Apple Inc), Ossama Abdel-Hamid (Apple Inc), Don McAllaster (Apple Inc)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2589.pdf

Towards Debugging Deep Neural Networks by Generating Speech Utterances
Bilal Soomro (University of Eastern Finland), Anssi Kanervisto (University of Eastern Finland), Trung Ngo Trong (University of Eastern Finland), Ville Hautamaki (University of Eastern Finland)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2339.pdf
Debugging is very nice idea

A Study for Improving Device-Directed Speech Detection toward Frictionless Human-Machine Interaction
Che-Wei Huang (Amazon), Roland Maas (Amazon.com), Sri Harish Mallidi (Amazon, USA), Bjorn Hoffmeister (Amazon.com)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2840.pdf
Nice idea, we covered that before

Deep Learning for Orca Call Type Identification — A Fully Unsupervised Approach
Christian Bergler, Manuel Schmitt, Rachael Xi Cheng, Andreas Maier, Volker Barth, Elmar Nöth
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1857.pdf
Kinda cool

The STC ASR System for the VOiCES from a Distance Challenge 2019
Ivan Medennikov (STC-innovations Ltd), Yuri Khokhlov (STC-innovations Ltd), Aleksei Romanenko (ITMO University), Ivan Sorokin (STC), Anton Mitrofanov (STC-innovations Ltd), Vladimir Bataev (Speech Technology Center Ltd), Andrei Andrusenko (STC-innovations Ltd), Tatiana Prisyach (STC-innovations Ltd), Mariya Korenevskaya (STC-innovations Ltd), Oleg Petrov (ITMO University), Alexander Zatvornitskiy (Speech Technology Center)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1574.pdf
Kaggle type and cool tricks (char based LM), congrats to STC

Continuous Emotion Recognition in Speech – Do We Need Recurrence?
Maximilian Schmitt (ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg), Nicholas Cummins (University of Augsburg), Björn Schuller (University of Augsburg / Imperial College London)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2710.pdf

Self-supervised speaker embeddings
Themos Stafylakis (Omilia - Conversational Intelligence), Johan Rohdin (Brno University of Technology), Oldrich Plchot (Brno University of Technology), Petr Mizera (Czech Technical University in Prague), Lukas Burget (Brno University of Technology)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2842.pdf
the word of the year

Better morphology prediction for better speech systems
Dravyansh Sharma (Carnegie Mellon University), Melissa Wilson (Google LLC), Antoine Bruguier (Google LLC)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/3207.pdf

Connecting and Comparing Language Model Interpolation Techniques
Ernest Pusateri, Christophe Van Gysel, Rami Botros, Sameer Badaskar, Mirko Hannemann, Youssef Oualil, Ilya Oparin
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1822.pdf
Worth to remind

Articulation rate as a metric in spoken language assessment
Calbert Graham (University of Cambridge), Francis Nolan (University of Cambridge)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2098.pdf

↧