Spectre and deep learning

November 28, 2019, 2:42 pm

≪ Previous: Selected Papers Interspeech 2019 Wednesday

I noticed a big slowdown in RELU layer performance recently, essentially the RELU operation can now take up to 10% in the total CPU count. This is with kernel 4.15. On older machines everything is just fine.

RELU is a computation of max(x, 0) for a vector of floats, so I suspect a Spectre patch which should significantly slowdown CPU branch prediction. Who could think about that.

The solution seems to be:

diff --git a/src/matrix/kaldi-matrix.cc b/src/matrix/kaldi-matrix.cc

index faf23cdf0..3ef686310 100644

--- a/src/matrix/kaldi-matrix.cc

+++ b/src/matrix/kaldi-matrix.cc

@@ -2164,8 +2164,10 @@ void MatrixBase::Floor(const MatrixBase&src, Real floor_val) {

const Real *src_row_data = src.Data();

for (MatrixIndexT row = 0; row < num_rows;

row++,row_data += stride_, src_row_data += src.stride_) {

- for (MatrixIndexT col = 0; col < num_cols; col++)

- row_data[col] = (src_row_data[col] < floor_val ? floor_val : src_row_data[col]);

+ for (MatrixIndexT col = 0; col < num_cols; col++) {

+ Real diff = src_row_data[col] - floor_val;

+ row_data[col] = (src_row_data[col] + floor_val + std::abs(diff)) * 0.5;

+ }

}

↧

Learning with huge memory

January 3, 2017, 1:57 pm

≫ Next: When information is already lost

≪ Previous: Spectre and deep learning

Recently a set of papers were published about "memorization" in neural networks. For example:

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

also

Understanding deep learning requires rethinking generalization

It seems that large memory system has a point, you don’t need millions of computing cores in CPU and, it is too power-expensive, you could just go ahead with very large memory and reasonable amount of cores to access memory with hashing (think of Shazam or randlm, or G2P by analogy). You probably do not need heavy tying either.

Advantages are: you can quickly incorporate new knowledge, just put new values in memory, you can model corner cases since they are all still accessible, and, again, you are much more energy-efficient.

Maybe we will see mobile phones with 1Tb of memory sometimes.

↧

When information is already lost

April 9, 2018, 8:54 am

≫ Next: Goodbye Google+

≪ Previous: Learning with huge memory

In speech recognition we frequently deal with noisy or simply corrupted recordings. For example, in call center recordings you still get error rates like 50% or 60% even with the best algorithms. Someone calls while driving, others on the noisy street. Some use really bad phones. And the question raises how to improve the accuracy for such recordings. People try different complex algorithms and train model on GPU for many weeks while the answer is simple - the information in such recordings is simply lost and you can not decode them accurately.

Data transfer and data storage are expensive and VoIP providers often save every cent since they all operate on very low margin. That means they often use codecs with bugs or bad transmission lines and as a result you simply get unintelligible speech. Then everyone uses cell phones thus you have multiple encoding-decoding rounds where information from the microphone is encoded with AMR then encoded into 729 and finally converted into MP3 for storage, you have many codecs and often frame drops. As a result the quality of sound is sometimes very bad. And recognition accuracy is zero.

The quality of speech is hard to measure but there are ways to do that. The easiest way requires controlled experiments where you send the data from one endpoint and measure distortion on another endpoint. There are also single-side tools developed by ITU in PESQ 563 that simply take an audio file and give you the sound quality score which takes many parameters into account and estimates the quality of the speech. They are rough, but still can give you some idea how noisy your audio is. And if it is really noisy, the way to improve it is not to apply better algorithms and put more research in speech recognition but simply go to the VoIP provider and demand better sound quality.

Given we have such a tool we might want to introduce the normalized word error rate which takes into account how good the recording is. So you really want to decode high quality recordings accurately and you probably do not care about bad quality recordings.

When accuracy matters, sound quality is really important. If possible you can use your own VoIP stack sending audio directly from the mobile microphone to the server. But when calls come to play, it is usually hopeless.

↧

Goodbye Google+

March 27, 2019, 1:49 pm

≫ Next: The theory of possibilities

≪ Previous: When information is already lost

Dear friends, as you know Google+ is shutting down. I considered several alternatives: Facebook, Quora, Linkedin, my old blog, Reddit, Twitter, Telegram. Unfortunately there are things I dislike in all of them.

In the age of big data we can certainly confirm that big data is just a big trash. The real ideas are always hidden from public discussions under the layers of disinformation. Incognito technologies are the ones that going to rule in the next technological age. The platform to support hidden data is not going to be public anyway.

Finally, my own stupid content is nothing but ramblings, I just translate the ideas I see around the web. It is way more interesting to read about my friends and colleagues. Not just technology news, but tiny personal things, even opinionated are of great importance and interest. Very sad to loose such a place that Google+ was. Ideally it would be nice to meet and discuss in person but its not always easy.

So it would be certainly great if you'll join few of the sites below and share your ideas and knowledge, I hope it will be beneficial for all of us.

https://t.me/speech_recognition - the new Telegram channel about speech recognition. I like Telegram for the UI simplicity and speed, for the elegance of technical solutions and extension capabilities. I'm also quite excited of Russian origins of Telegram, supposedly the product of Russian intelligence agencies.

https://www.quora.com/q/usejrrgnezvhiyup - the space on Quora. I find some content on Quora quite offensive, but I also find many extremely interesting answers there from really nice people. I also find that Quora is very helpful to establish new connections and promote the ideas.

https://www.linkedin.com/groups/8614109/ - over the years I found Linkedin extremely useful for business and establishing connections. They screwed group discussions, they screwed the UI, they lost many opportunities, but they still remain the top business network. Hopefully they will catch up with the issues they have. I hope the group is going to be useful channel to get in touch and learn more about recent developments.

I also hope to continue with the blog, update our company website and continue with the developments. More on that in the future.

Please let me know what is your opinion.

↧

The theory of possibilities

June 12, 2019, 6:47 am

≫ Next: The masking problem - capsules, specaug, bert

≪ Previous: Goodbye Google+

I've got quite interested in the future prediction these days, one nice idea by Russian writer Sergey Borisovich Pereslegin is that we should build the future based on the theory of possibilities rather than the theory of probabilities. This is a very deep idea actually.

The probability theory is very common these days and everyone is applying Bayes things here and there. But the problem with probability theory it can only predict probable thing which are known before or have been observed before.

The theory of possibilities can discover new unknown things.

This is quite a researched subject surprisingly, for example one can check
Possibility Theory and its Applications: Where Do we Stand ? by Didier Dubois and Henri Prade

↧

The masking problem - capsules, specaug, bert

August 25, 2019, 2:44 pm

≫ Next: Information flows of the future

≪ Previous: The theory of possibilities

An important issue with a modern neural networks is their vulnerability to the masked corruption, that is the random corruption of some small amount of samples in the image or sound. It is well known that human is very robust about such noise, a man can ignore slightly but randomly corrupted pictures, sounds, sentences. MP3 compression is using masking to drop unimportant bits of sounds. Random impulse background noise usually has little effect on speech recognition by human. On the other hand it is very easy to demonstrate that modern ASR is extremely valuable to random spontaneous noise and that really makes a difference, even slight change of some frequencies can harm the accuracy a lot.

Hinton understood this problem and that is why he proposed capsule networks as a solution. The idea is that by using an agreement between a set of experts you can get more reliable prediction ignoring unreliable parts. Capsules are not very popular yet, but they were exactly thought to solve the masking problem.

On the other hand, Google/Facebook/OpenAI tried to solve the same problem with more traditional networks. They still use deep and connected architectures, but they decided to corrupt the dataset with masks during the training and teach the model to recognize it. And it does work well too, for example, remember SpecAugment success in speech recognition, BERT/ROBERTA/XLM in NLP are very good examples too.

On the path to reproduce this idea it is important to understand one thing. Since neural network effectively memorizes the input, to properly recognize masked images trainer has to see all possible masks and has to store their vectors in a network. That means - training process has be much much longer and the network has to be much much bigger. We see that in BERT. Kaldi people also saw it when they tried to reproduce SpecAugment.

Given that, some future ideas:

1. SpecAugment is not really random masking, it either drops the column or the raw. I predict more effective masking would be to randomly drop the 15% of the values on the whole 2-d spectrum, something Bert-style. I think in the near future we shall see that idea implemented.

2. The idea of masking can be applied to other sequence modeling problems in speech, for example, in TTS, we shall see it soon in vocoders and in transformer/tacotron models.

3. The waste of resources for training and decoding with masking is obvious, a more intelligent architecture to recognize masked inputs might change the things significantly.

Thanks to Robit Mann on @cmusphinx for the initial idea.

↧

Information flows of the future

September 8, 2019, 2:00 pm

≫ Next: Selected Papers Interspeech 2019 Monday

≪ Previous: The masking problem - capsules, specaug, bert

It is interesting how similar ideas raise here and there in seemingly unrelated context. The recent quote from Actionable Book Summary: The Inevitable by Kevin Kelly:

And what’s next probably looks like this: Imagine zillion streams of information interacting with each other, communicating, pulsating. A new type of computer, tracking and recording everything we do. The future will be less about owning stuff and more about being part of flowing information that will supposedly make our lives easier.

Compare to the quote from Secushare draft:

secushare employs GNUnet for end-to-end encryption and anonymizing mesh routing (because it has a more suitable architecture than Tor or I2P) and applies PSYC on top (because it performs better than XMPP, JSON or OStatus) to create a distributed social graph.

↧

Selected Papers Interspeech 2019 Monday

September 16, 2019, 6:56 am

≫ Next: Selected Papers Interspeech 2019 Tuesday

≪ Previous: Information flows of the future

Overall, it is going pretty good. Many very good papers, diarization joins with decoding, everything goes to the right direction.

RadioTalk: a large-scale corpus of talk radio transcripts

Doug Beeferman (MIT Media Lab), William Brannon (MIT Media Lab), Deb Roy (MIT Media Lab)

248000 hours dataset
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2714.pdf

Automatic lyric transcription from Karaoke vocal tracks: Resources and a Baseline System

Gerardo Roa (University of Sheffield), Jon Barker (University of Sheffield)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2378.pdf

https://github.com/groadabike/Kaldi-Dsing-task

Speaker Diarization with Lexical Information

Tae Jin Park, Kyu J. Han, Jing Huang, Xiaodong He, Bowen Zhou, Panayiotis Georgiou, Shrikanth Narayanan

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1947.pdf

Full-Sentence Correlation: a Method to Handle Unpredictable Noise for Robust Speech Recognition

Ming Ji (Queen's University Belfast), Danny Crookes (Queen's University Belfast)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2127.pdf

Untranscribed Web Audio for Low Resource Speech Recognition

Andrea Carmantini, Peter Bell, Steve Renals

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2623.pdf

Building Large-Vocabulary ASR Systems for Languages Without Any Audio Training Data

Manasa Prasad, Daan van Esch, Sandy Ritchie, Jonas Fromseier Mortensen

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1775.pdf

How to annotate 100 hours in 45 minutes

Per Fallgren (KTH Royal Institute of Technology), Zofia Malisz (KTH, Stockholm), Jens Edlund (KTH Speech, Music and Hearing)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1648.pdf

Exploiting semi-supervised training through a dropout regularization in end-to-end speech recognition

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/3246.pdf

High quality - lightweight and adaptable TTS using LPCNet

Zvi Kons (IBM Haifa research lab), Slava Shechtman (Speech Technologies, IBM Research AI), Alexander Sorin (IBM Research - Haifa), Carmel Rabinovitz (IBM Research - Haifa), Ron Hoory (IBM Haifa Research Lab)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1705.pdf

http://srv-wtts.haifa.il.ibm.com/TTS-voice-conversion-IS2019/

Very nice quality

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Ziping Zhao, Zhongtian Bao, Zixing Zhang, Nicholas Cummins, Haishuai Wang, Björn W. Schuller

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1649.pdf

Large-Scale Mixed-Bandwidth Deep Neural Network Acoustic Modeling for Automatic Speech Recognition

Khoi-Nguyen Mac (University of Illinois at Urbana-Champaign), Xiaodong Cui (IBM T. J. Watson Research Center), Wei Zhang (IBM T. J. Watson Research Center), Michael Picheny (IBM T. J. Watson Research Center)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2641.pdf

An Investigation into On-Device Personalization of End-to-End Automatic Speech Recognition Models

Khe Chai Sim, Petr Zadrazil, Françoise Beaufays

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1752.pdf

↧

Selected Papers Interspeech 2019 Tuesday

September 16, 2019, 2:48 pm

≫ Next: Selected Papers Interspeech 2019 Wednesday

≪ Previous: Selected Papers Interspeech 2019 Monday

Spatial and Spectral Fingerprint in The Brain: Speaker Identification from Single Trial MEG Signals

Oral; 1000–1020

Debadatta Dash (The University of Texas at Dallas), Paul Ferrari (University of Texas at Austin), Jun Wang (University of Texas at Dallas)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/3105.pdf

Investigating the robustness of sequence-to-sequence text-to-speech models to imperfectly-transcribed training data

Jason Fong (University of Edinburgh), Pilar Oplustil (University of Edinburgh), Zack Hodari (University of Edinburgh), Simon King (University of Edinburgh)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1824.pdf

Using pupil dilation to measure cognitive load when listening to text-to-speech in quiet and in noise

Poster; 1000–1200

Avashna Govender (The Centre for Speech Technology Research, University of Edinburgh), Anita E Wagner (Graduate School of Medical Sciences, School of Behavioural and Cognitive Neurosciences, University of Groningen), Simon King (University of Edinburgh)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1783.pdf

Leveraging Acoustic Cues and Paralinguistic Embeddings to Detect Expression from Voice

Poster; 1000–1200

Vikramjit Mitra (Apple Inc.), Sue Booker (Apple Inc.), Erik Marchi (Apple Inc), David Scott Farrar (Apple Inc.), Ute Dorothea Peitz (Apple Inc.), Bridget Cheng (Apple Inc.), Ermine Teves (Apple Inc.), Anuj Mehta (Apple Inc.), Devang Naik (Apple)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2998.pdf

Acoustic Modeling for Automatic Lyrics-to-Audio Alignment

Chitralekha Gupta (National University of Singapore), Emre Yilmaz (National University of Singapore), Haizhou Li (National University of Singapore)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1520.pdf

STC Antispoofing Systems for the ASVspoof2019 Challenge

Galina Lavrentyeva (ITMO University, Speech Technology Center), Sergey Novoselov (ITMO University, Speech Technology Center), Tseren Andzhukaev (Speech Technology Center), Marina Volkova (Speech Technology Center), Artem Gorlanov (Speech Technology Center), Alexandr Kozlov (Speech Technology Center Ltd.)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1768.pdf

Developing Pronunciation Models in New Languages Faster by Exploiting Common Grapheme-to-Phoneme Correspondences Across Languages

Harry Bleyan (Google), Sandy Ritchie (Google), Jonas Fromseier Mortensen (Google), Daan van Esch (Google)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1781.pdf

Multilingual Speech Recognition with Corpus Relatedness Sampling

Xinjian Li, Siddharth Dalmia, Alan W. Black, Florian Metze

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/3052.pdf

On the Use/Misuse of the Term 'Phoneme'

Roger Moore (University of Sheffield), Lucy Skidmore (University of Sheffield)

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2711.pdf

↧

Selected Papers Interspeech 2019 Wednesday

September 17, 2019, 1:59 pm

≫ Next: Spectre and deep learning

≪ Previous: Selected Papers Interspeech 2019 Tuesday

A Highly Efficient Distributed Deep Learning System for Automatic Speech Recognition
Wei Zhang, Xiaodong Cui, Ulrich Finkler, George Saon, Abdullah Kayi, Alper Buyuktosunoglu, Brian Kingsbury, David Kung, Michael Picheny
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2700.pdf
Cool merge graphs

Detection and Recovery of OOVs for Improved English Broadcast News Captioning
Samuel Thomas (IBM Research AI), Kartik Audhkhasi (IBM Research AI), Zoltan Tuske (IBM Research AI), Yinghui Huang (IBM Research AI), Michael Picheny (IBM Research AI)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2793.pdf
Nothing new but still important

Disfluencies and Human Speech Transcription Errors
Vicky Zayats (University of Washington), Trang Tran (University of Washington), Courtney Mansfield (University of Washington), Richard Wright (University of Washington), Mari Ostendorf (University of Washington)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/3134.pdf

Robust Sound Recognition: A Neuromorphic Approach
Jibin Wu (National University of Singapore), Zihan Pan , Malu Zhang , Rohan Kumar Das , Yansong Chua , Haizhou Li
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/8032.pdf
Spiking neural networks

Neural Named Entity Recognition from Subword Units
Abdalghani Abujabal (Max Planck Institute for Informatics), Judith Gaspers (Amazon)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1305.pdf
Names recognition is still important

Unsupervised Acoustic Segmentation and Clustering using Siamese Network Embeddings
Saurabhchand Bhati (The Johns Hopkins University), Shekhar Nayak (Indian Institute of Technology Hyderabad), Sri Rama Murty Kodukula (IIT Hyderabad), Najim Dehak (Johns Hopkins University)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2981.pdf

Acoustic Model Bootstrapping Using Semi-Supervised Learning
Langzhou Chen (Amazon Cambridge office), Volker Leutnant (Amazon Aachen office)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2818.pdf

Bandwidth Embeddings for Mixed-bandwidth Speech Recognition
Gautam Mantena (Apple Inc.), Ozlem Kalinli (Apple Inc), Ossama Abdel-Hamid (Apple Inc), Don McAllaster (Apple Inc)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2589.pdf

Towards Debugging Deep Neural Networks by Generating Speech Utterances
Bilal Soomro (University of Eastern Finland), Anssi Kanervisto (University of Eastern Finland), Trung Ngo Trong (University of Eastern Finland), Ville Hautamaki (University of Eastern Finland)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2339.pdf
Debugging is very nice idea

A Study for Improving Device-Directed Speech Detection toward Frictionless Human-Machine Interaction
Che-Wei Huang (Amazon), Roland Maas (Amazon.com), Sri Harish Mallidi (Amazon, USA), Bjorn Hoffmeister (Amazon.com)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2840.pdf
Nice idea, we covered that before

Deep Learning for Orca Call Type Identification — A Fully Unsupervised Approach
Christian Bergler, Manuel Schmitt, Rachael Xi Cheng, Andreas Maier, Volker Barth, Elmar Nöth
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1857.pdf
Kinda cool

The STC ASR System for the VOiCES from a Distance Challenge 2019
Ivan Medennikov (STC-innovations Ltd), Yuri Khokhlov (STC-innovations Ltd), Aleksei Romanenko (ITMO University), Ivan Sorokin (STC), Anton Mitrofanov (STC-innovations Ltd), Vladimir Bataev (Speech Technology Center Ltd), Andrei Andrusenko (STC-innovations Ltd), Tatiana Prisyach (STC-innovations Ltd), Mariya Korenevskaya (STC-innovations Ltd), Oleg Petrov (ITMO University), Alexander Zatvornitskiy (Speech Technology Center)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1574.pdf
Kaggle type and cool tricks (char based LM), congrats to STC

Continuous Emotion Recognition in Speech – Do We Need Recurrence?
Maximilian Schmitt (ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg), Nicholas Cummins (University of Augsburg), Björn Schuller (University of Augsburg / Imperial College London)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2710.pdf

Self-supervised speaker embeddings
Themos Stafylakis (Omilia - Conversational Intelligence), Johan Rohdin (Brno University of Technology), Oldrich Plchot (Brno University of Technology), Petr Mizera (Czech Technical University in Prague), Lukas Burget (Brno University of Technology)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2842.pdf
the word of the year

Better morphology prediction for better speech systems
Dravyansh Sharma (Carnegie Mellon University), Melissa Wilson (Google LLC), Antoine Bruguier (Google LLC)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/3207.pdf

Connecting and Comparing Language Model Interpolation Techniques
Ernest Pusateri, Christophe Van Gysel, Rami Botros, Sameer Badaskar, Mirko Hannemann, Youssef Oualil, Ilya Oparin
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1822.pdf
Worth to remind

Articulation rate as a metric in spoken language assessment
Calbert Graham (University of Cambridge), Francis Nolan (University of Cambridge)
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2098.pdf

↧