Quantcast
Channel: Speech Recognition With Vosk
Viewing all 51 articles
Browse latest View live

Spectre and deep learning

$
0
0
I noticed a big slowdown in RELU layer performance recently, essentially the RELU operation can now take up to 10% in the total CPU count. This is with kernel 4.15. On older machines everything is just fine.

RELU is a computation of max(x, 0) for a vector of floats, so I suspect a Spectre patch which should significantly slowdown CPU branch prediction. Who could think about that.

The solution seems to be:

diff --git a/src/matrix/kaldi-matrix.cc b/src/matrix/kaldi-matrix.cc
index faf23cdf0..3ef686310 100644
--- a/src/matrix/kaldi-matrix.cc
+++ b/src/matrix/kaldi-matrix.cc
@@ -2164,8 +2164,10 @@ void MatrixBase::Floor(const MatrixBase&src, Real floor_val) {
   const Real *src_row_data = src.Data();
   for (MatrixIndexT row = 0; row < num_rows;
        row++,row_data += stride_, src_row_data += src.stride_) {
-    for (MatrixIndexT col = 0; col < num_cols; col++)
-      row_data[col] = (src_row_data[col] < floor_val ? floor_val : src_row_data[col]);
+    for (MatrixIndexT col = 0; col < num_cols; col++) {
+      Real diff = src_row_data[col] - floor_val;
+      row_data[col] = (src_row_data[col] + floor_val + std::abs(diff)) * 0.5;
+    }
   }
 }




Learning with huge memory

$
0
0
Recently a set of papers were published about "memorization" in neural networks. For example:

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

also

Understanding deep learning requires rethinking generalization

It seems that large memory system has a point, you don’t need millions of computing cores in CPU and, it is too power-expensive, you could just go ahead with very large memory and reasonable amount of cores to access memory with hashing (think of Shazam or randlm, or G2P by analogy). You probably do not need heavy tying either.

Advantages are: you can quickly incorporate new knowledge, just put new values in memory, you can model corner cases since they are all still accessible, and, again, you are much more energy-efficient.

Maybe we will see mobile phones with 1Tb of memory sometimes.

When information is already lost

$
0
0
In speech recognition we frequently deal with noisy or simply corrupted recordings. For example, in call center recordings you still get error rates like 50% or 60% even with the best algorithms. Someone calls while driving, others on the noisy street. Some use really bad phones. And the question raises how to improve the accuracy for such recordings. People try different complex algorithms and train model on GPU for many weeks while the answer is simple - the information in such recordings is simply lost and you can not decode them accurately.

Data transfer and data storage are expensive and VoIP providers often save every cent since they all operate on very low margin. That means they often use codecs with bugs or bad transmission lines and as a result you simply get unintelligible speech. Then everyone uses cell phones thus you have multiple encoding-decoding rounds where information from the microphone is encoded with AMR then encoded into 729 and finally converted into MP3 for storage,  you have many codecs and often frame drops. As a result the quality of sound is sometimes very bad. And recognition accuracy is zero.

The quality of speech is hard to measure but there are ways to do that. The easiest way requires controlled experiments where you send the data from one endpoint and measure distortion on another endpoint. There are also single-side tools developed by ITU in PESQ 563 that simply take an audio file and give you the sound quality score which takes many parameters into account and estimates the quality of the speech. They are rough, but still can give you some idea how noisy your audio is. And if it is really noisy, the way to improve it is not to apply better algorithms and put more research in speech recognition but simply go to the VoIP provider and demand better sound quality.

Given we have such a tool we might want to introduce the normalized word error rate which takes into account how good the recording is. So you really want to decode high quality recordings accurately and you probably do not care about bad quality recordings.

When accuracy matters, sound quality is really important. If possible you can use your own VoIP stack sending audio directly from the mobile microphone to the server. But when calls come to play, it is usually hopeless.


Goodbye Google+

$
0
0
Dear friends, as you know Google+ is shutting down. I considered several alternatives: Facebook, Quora, Linkedin, my old blog, Reddit, Twitter, Telegram. Unfortunately there are things I dislike in all of them.

In the age of big data we can certainly confirm that big data is just a big trash. The real ideas are always hidden from public discussions under the layers of disinformation. Incognito technologies are the ones that going to rule in the next technological age. The platform to support hidden data is not going to be public anyway.

Finally, my own stupid content is nothing but ramblings, I just translate the ideas I see around the web. It is way more interesting to read about my friends and colleagues. Not just technology news, but tiny personal things, even opinionated are of great importance and interest. Very sad to loose such a place that Google+ was. Ideally it would be nice to meet and discuss in person but its not always easy.

So it would be certainly great if you'll join few of the sites below and share your ideas and knowledge, I hope it will be beneficial for all of us.

https://t.me/speech_recognition - the new Telegram channel about speech recognition. I like Telegram for the UI simplicity and speed, for the elegance of technical solutions and extension capabilities. I'm also quite excited of Russian origins of Telegram, supposedly the product of Russian intelligence agencies.

https://www.quora.com/q/usejrrgnezvhiyup - the space on Quora. I find some content on Quora quite offensive, but I also find many extremely interesting answers there from really nice people. I also find that Quora is very helpful to establish new connections and promote the ideas.

https://www.linkedin.com/groups/8614109/ - over the years I found Linkedin extremely useful for business and establishing connections. They screwed group discussions, they screwed the UI, they lost many opportunities, but they still remain the top business network. Hopefully they will catch up with the issues they have. I hope the group is going to be useful channel to get in touch and learn more about recent developments.

I also hope to continue with the blog, update our company website and continue with the developments. More on that in the future.

Please let me know what is your opinion.

The theory of possibilities

$
0
0
I've got quite interested in the future prediction these days, one nice idea by Russian writer Sergey Borisovich Pereslegin is that we should build the future based on the theory of possibilities rather than the theory of probabilities. This is a very deep idea actually.

The probability theory is very common these days and everyone is applying Bayes things here and there. But the problem with probability theory it can only predict probable thing which are known before or have been observed before.

The theory of possibilities can discover new unknown things.

This is quite a researched subject surprisingly, for example one can check
Possibility Theory and its Applications: Where Do we Stand ? by Didier Dubois and Henri Prade

The masking problem - capsules, specaug, bert

$
0
0
An important issue with a modern neural networks is their vulnerability to the masked corruption, that is the random corruption of some small amount of samples in the image or sound. It is well known that human is very robust about such noise, a man can ignore slightly but randomly corrupted pictures, sounds, sentences. MP3 compression is using masking to drop unimportant bits of sounds. Random impulse background noise usually has little effect on speech recognition by human. On the other hand it is very easy to demonstrate that modern ASR is extremely valuable to random spontaneous noise and that really makes a difference, even slight change of some frequencies can harm the accuracy a lot.

Hinton understood this problem and that is why he proposed capsule networks as a solution. The idea is that by using an agreement between a set of experts you can get more reliable prediction ignoring unreliable parts. Capsules are not very popular yet, but they were exactly thought to solve the masking problem.

On the other hand, Google/Facebook/OpenAI tried to solve the same problem with more traditional networks. They still use deep and connected architectures, but they decided to corrupt the dataset with masks during the training and teach the model to recognize it. And it does work well too, for example, remember SpecAugment success in speech recognition, BERT/ROBERTA/XLM in NLP are very good examples too.

On the path to reproduce this idea it is important to understand one thing. Since neural network effectively memorizes the input, to properly recognize masked images trainer has to see all possible masks and has to store their vectors in a network. That means - training process has be much much longer and the network has to be much much bigger. We see that in BERT. Kaldi people also saw it when they tried to reproduce SpecAugment.

Given that, some future ideas:

1. SpecAugment is not really random masking, it either drops the column or the raw. I predict more effective masking would be to randomly drop the 15% of the values on the whole 2-d spectrum, something Bert-style. I think in the near future we shall see that idea implemented.

2. The idea of masking can be applied to other sequence modeling problems in speech, for example,  in TTS, we shall see it soon in vocoders and in transformer/tacotron models.

3. The waste of resources for training and decoding with masking is obvious, a more intelligent architecture to recognize masked inputs might change the things significantly.

Thanks to Robit Mann on @cmusphinx for the initial idea.

Information flows of the future

$
0
0
It is interesting how similar ideas raise here and there in seemingly unrelated context. The recent quote from Actionable Book Summary: The Inevitable by Kevin Kelly:
And what’s next probably looks like this: Imagine zillion streams of information interacting with each other, communicating, pulsating. A new type of computer, tracking and recording everything we do. The future will be less about owning stuff and more about being part of flowing information that will supposedly make our lives easier.
Compare to the quote from Secushare draft:
secushare employs GNUnet for end-to-end encryption and anonymizing mesh routing (because it has a more suitable architecture than Tor or I2P) and applies PSYC on top (because it performs better than XMPP, JSON or OStatus) to create a distributed social graph.

Selected Papers Interspeech 2019 Monday

$
0
0
Overall, it is going pretty good. Many very good papers, diarization joins with decoding, everything goes to the right direction.


RadioTalk: a large-scale corpus of talk radio transcripts 
Doug Beeferman (MIT Media Lab), William Brannon (MIT Media Lab), Deb Roy (MIT Media Lab)

Automatic lyric transcription from Karaoke vocal tracks: Resources and a Baseline System 
Gerardo Roa (University of Sheffield), Jon Barker (University of Sheffield)

Speaker Diarization with Lexical Information
Tae Jin Park, Kyu J. Han, Jing Huang, Xiaodong He, Bowen Zhou, Panayiotis Georgiou, Shrikanth Narayanan

Full-Sentence Correlation: a Method to Handle Unpredictable Noise for Robust Speech Recognition 
Ming Ji (Queen's University Belfast), Danny Crookes (Queen's University Belfast)

Untranscribed Web Audio for Low Resource Speech Recognition
Andrea Carmantini, Peter Bell, Steve Renals 

Building Large-Vocabulary ASR Systems for Languages Without Any Audio Training Data
Manasa Prasad, Daan van Esch, Sandy Ritchie, Jonas Fromseier Mortensen 

How to annotate 100 hours in 45 minutes 
Per Fallgren (KTH Royal Institute of Technology), Zofia Malisz (KTH, Stockholm), Jens Edlund (KTH Speech, Music and Hearing)

Exploiting semi-supervised training through a dropout regularization in end-to-end speech recognition 

High quality - lightweight and adaptable TTS using LPCNet 
Zvi Kons (IBM Haifa research lab), Slava Shechtman (Speech Technologies, IBM Research AI), Alexander Sorin (IBM Research - Haifa), Carmel Rabinovitz (IBM Research - Haifa), Ron Hoory (IBM Haifa Research Lab)
Very nice quality

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition
Ziping Zhao, Zhongtian Bao, Zixing Zhang, Nicholas Cummins, Haishuai Wang, Björn W. Schuller

Large-Scale Mixed-Bandwidth Deep Neural Network Acoustic Modeling for Automatic Speech Recognition 
Khoi-Nguyen Mac (University of Illinois at Urbana-Champaign), Xiaodong Cui (IBM T. J. Watson Research Center), Wei Zhang (IBM T. J. Watson Research Center), Michael Picheny (IBM T. J. Watson Research Center)

An Investigation into On-Device Personalization of End-to-End Automatic Speech Recognition Models
Khe Chai Sim, Petr Zadrazil, Françoise Beaufays

Selected Papers Interspeech 2019 Tuesday

$
0
0
Spatial and Spectral Fingerprint in The Brain: Speaker Identification from Single Trial MEG Signals 
Oral; 1000–1020
Debadatta Dash (The University of Texas at Dallas), Paul Ferrari (University of Texas at Austin), Jun Wang (University of Texas at Dallas)

Investigating the robustness of sequence-to-sequence text-to-speech models to imperfectly-transcribed training data 
Jason Fong (University of Edinburgh), Pilar Oplustil (University of Edinburgh), Zack Hodari (University of Edinburgh), Simon King (University of Edinburgh)

Using pupil dilation to measure cognitive load when listening to text-to-speech in quiet and in noise 
Poster; 1000–1200
Avashna Govender (The Centre for Speech Technology Research, University of Edinburgh), Anita E Wagner (Graduate School of Medical Sciences, School of Behavioural and Cognitive Neurosciences, University of Groningen), Simon King (University of Edinburgh)

Leveraging Acoustic Cues and Paralinguistic Embeddings to Detect Expression from Voice 
Poster; 1000–1200
Vikramjit Mitra (Apple Inc.), Sue Booker (Apple Inc.), Erik Marchi (Apple Inc), David Scott Farrar (Apple Inc.), Ute Dorothea Peitz (Apple Inc.), Bridget Cheng (Apple Inc.), Ermine Teves (Apple Inc.), Anuj Mehta (Apple Inc.), Devang Naik (Apple)

Acoustic Modeling for Automatic Lyrics-to-Audio Alignment 
Chitralekha Gupta (National University of Singapore), Emre Yilmaz (National University of Singapore), Haizhou Li (National University of Singapore)

STC Antispoofing Systems for the ASVspoof2019 Challenge 
Galina Lavrentyeva (ITMO University, Speech Technology Center), Sergey Novoselov (ITMO University, Speech Technology Center), Tseren Andzhukaev (Speech Technology Center), Marina Volkova (Speech Technology Center), Artem Gorlanov (Speech Technology Center), Alexandr Kozlov (Speech Technology Center Ltd.)

Developing Pronunciation Models in New Languages Faster by Exploiting Common Grapheme-to-Phoneme Correspondences Across Languages 
Harry Bleyan (Google), Sandy Ritchie (Google), Jonas Fromseier Mortensen (Google), Daan van Esch (Google)

Multilingual Speech Recognition with Corpus Relatedness Sampling
Xinjian Li, Siddharth Dalmia, Alan W. Black, Florian Metze 

On the Use/Misuse of the Term 'Phoneme' 
Roger Moore (University of Sheffield), Lucy Skidmore (University of Sheffield)

Selected Papers Interspeech 2019 Wednesday

$
0
0
A Highly Efficient Distributed Deep Learning System for Automatic Speech Recognition
Wei Zhang, Xiaodong Cui, Ulrich Finkler, George Saon, Abdullah Kayi, Alper Buyuktosunoglu, Brian Kingsbury, David Kung, Michael Picheny
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2700.pdf
Cool merge graphs

Detection and Recovery of OOVs for Improved English Broadcast News Captioning
Samuel Thomas (IBM Research AI), Kartik Audhkhasi (IBM Research AI), Zoltan Tuske (IBM Research AI), Yinghui Huang (IBM Research AI), Michael Picheny (IBM Research AI)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2793.pdf
Nothing new but still important

Disfluencies and Human Speech Transcription Errors
Vicky Zayats (University of Washington), Trang Tran (University of Washington), Courtney Mansfield (University of Washington), Richard Wright (University of Washington), Mari Ostendorf (University of Washington)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/3134.pdf

Robust Sound Recognition: A Neuromorphic Approach
Jibin Wu (National University of Singapore), Zihan Pan , Malu Zhang , Rohan Kumar Das , Yansong Chua , Haizhou Li
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/8032.pdf
Spiking neural networks

Neural Named Entity Recognition from Subword Units
Abdalghani Abujabal (Max Planck Institute for Informatics), Judith Gaspers (Amazon)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1305.pdf
Names recognition is still important

Unsupervised Acoustic Segmentation and Clustering using Siamese Network Embeddings
Saurabhchand Bhati (The Johns Hopkins University), Shekhar Nayak (Indian Institute of Technology Hyderabad), Sri Rama Murty Kodukula (IIT Hyderabad), Najim Dehak (Johns Hopkins University)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2981.pdf

Acoustic Model Bootstrapping Using Semi-Supervised Learning
Langzhou Chen (Amazon Cambridge office), Volker Leutnant (Amazon Aachen office)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2818.pdf

Bandwidth Embeddings for Mixed-bandwidth Speech Recognition
Gautam Mantena (Apple Inc.), Ozlem Kalinli (Apple Inc), Ossama Abdel-Hamid (Apple Inc), Don McAllaster (Apple Inc)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2589.pdf

Towards Debugging Deep Neural Networks by Generating Speech Utterances
Bilal Soomro (University of Eastern Finland), Anssi Kanervisto (University of Eastern Finland), Trung Ngo Trong (University of Eastern Finland), Ville Hautamaki (University of Eastern Finland)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2339.pdf
Debugging is very nice idea

A Study for Improving Device-Directed Speech Detection toward Frictionless Human-Machine Interaction
Che-Wei Huang (Amazon), Roland Maas (Amazon.com), Sri Harish Mallidi (Amazon, USA), Bjorn Hoffmeister (Amazon.com)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2840.pdf
Nice idea, we covered that before

Deep Learning for Orca Call Type Identification — A Fully Unsupervised Approach
Christian Bergler, Manuel Schmitt, Rachael Xi Cheng, Andreas Maier, Volker Barth, Elmar Nöth
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1857.pdf
Kinda cool

The STC ASR System for the VOiCES from a Distance Challenge 2019
Ivan Medennikov (STC-innovations Ltd), Yuri Khokhlov (STC-innovations Ltd), Aleksei Romanenko (ITMO University), Ivan Sorokin (STC), Anton Mitrofanov (STC-innovations Ltd), Vladimir Bataev (Speech Technology Center Ltd), Andrei Andrusenko (STC-innovations Ltd), Tatiana Prisyach (STC-innovations Ltd), Mariya Korenevskaya (STC-innovations Ltd), Oleg Petrov (ITMO University), Alexander Zatvornitskiy (Speech Technology Center)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1574.pdf
Kaggle type and cool tricks (char based LM), congrats to STC

Continuous Emotion Recognition in Speech – Do We Need Recurrence?
Maximilian Schmitt (ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg), Nicholas Cummins (University of Augsburg), Björn Schuller (University of Augsburg / Imperial College London)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2710.pdf

Self-supervised speaker embeddings
Themos Stafylakis (Omilia - Conversational Intelligence), Johan Rohdin (Brno University of Technology), Oldrich Plchot (Brno University of Technology), Petr Mizera (Czech Technical University in Prague), Lukas Burget (Brno University of Technology)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2842.pdf
the word of the year

Better morphology prediction for better speech systems
Dravyansh Sharma (Carnegie Mellon University), Melissa Wilson (Google LLC), Antoine Bruguier (Google LLC)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/3207.pdf

Connecting and Comparing Language Model Interpolation Techniques
Ernest Pusateri, Christophe Van Gysel, Rami Botros, Sameer Badaskar, Mirko Hannemann, Youssef Oualil, Ilya Oparin
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1822.pdf
Worth to remind

Articulation rate as a metric in spoken language assessment
Calbert Graham (University of Cambridge), Francis Nolan (University of Cambridge)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2098.pdf

Spectre and deep learning

$
0
0
I noticed a big slowdown in RELU layer performance recently, essentially the RELU operation can now take up to 10% in the total CPU count. This is with kernel 4.15. On older machines everything is just fine.

RELU is a computation of max(x, 0) for a vector of floats, so I suspect a Spectre patch which should significantly slowdown CPU branch prediction. Who could think about that.

The solution seems to be:

diff --git a/src/matrix/kaldi-matrix.cc b/src/matrix/kaldi-matrix.cc
index faf23cdf0..3ef686310 100644
--- a/src/matrix/kaldi-matrix.cc
+++ b/src/matrix/kaldi-matrix.cc
@@ -2164,8 +2164,10 @@ void MatrixBase::Floor(const MatrixBase&src, Real floor_val) {
   const Real *src_row_data = src.Data();
   for (MatrixIndexT row = 0; row < num_rows;
        row++,row_data += stride_, src_row_data += src.stride_) {
-    for (MatrixIndexT col = 0; col < num_cols; col++)
-      row_data[col] = (src_row_data[col] < floor_val ? floor_val : src_row_data[col]);
+    for (MatrixIndexT col = 0; col < num_cols; col++) {
+      Real diff = src_row_data[col] - floor_val;
+      row_data[col] = (src_row_data[col] + floor_val + std::abs(diff)) * 0.5;
+    }
   }
 }



Viewing all 51 articles
Browse latest View live