Speech recognition and synthesis are probably the most iconic aspects of speech research. While speech compression (e.g. in mobile phones) is by far the most important speech techology in terms of daily use, it is automatic speech recognition (ASR) and synthesis or text-to-speech (TTS) that capture the imagination of the public.
When we are experimenting with speech recognition the diagram below, from the book, explains the typical process that we use;
A voice activity detector (VAD) is one of those backgroud pieces of technology that are essential to the operation of real-life systems but yet which are less commonly considered. It is the VAD that tells your mobile phone when you are talking (and when to consume precious battery power and mobile bandwidth to encode/transmit your speech). In quiet locations and with a strong voice, VAD is very easy, but becomes much more difficult in noise - especially multi-speaker babble - and wich quite speech like whispers.
The example VAD code given in the text is;
%the noisy speech is in array nspeech %fs is the sample rate L=length(nspeech); frame=0.1; %frame size in seconds Ws=floor(fs*frame); %length Nf=floor(L/Ws); %no. of frames energy=[]; %plot the noisy speech waveform subplot(2,1,1) plot([0:L-1]/fs,nspeech);axis tight xlabel('Time,s'); ylabel('Amplitude'); %divide into frames, get energy for n=1:Nf seg=nspeech(1+(n-1)*Ws:n*Ws); energy(n)=sum(seg.^2); end %plot the energy subplot(2,1,2) bar([1:Nf]*frame,energy,'y'); A=axis; A(2)=(Nf-1)*frame; axis(A) xlabel('Time,s'); ylabel('Energy'); %find the maximum energy, and threshold emax=max(energy); emin=min(energy); e10=emin+0.1*(emax-emin); %draw the threshold on the graph line([0 Nf-1]*frame,[e10 e10]) %plot the decision (frames > 10%) hold on plot([1:Nf]*frame,(energy>e10)*(emax),'ro') hold off
The result of this code will be something like the plot below (in this case the speech is Winston Churchill - as explained in the book - and the noise is something a lot more modern);
We will follow the examples given in the book from page 313 onwards.
First the setup phase.
Pi=[0.7, 0.1, 0.2]; B=[0.1, 0.02, 0.6]; A=[0.5 0.2 0.3 0.15 0.6 0.25 0.1 0.4 0.5]; N=length(Pi); X=[0 0 0 0 1 0 0]; T=length(X); % % alpha=zeros(T,N); %initial state alpha(1,1:N)=B(:).*Pi(:); for t=1:T-1 for Pi=1:3 alpha(t+1,Pi)=B(Pi)*sum(A(Pi,:)*alpha(t,:)'); end end
Iterate through the observations;
for t=1:T-1 for Pi=1:3 alpha(t+1,Pi)=B(Pi)*sum(A(Pi,:)*alpha(t,:)'); end end
This gives an alpha matrix as follows;
>> alpha alpha = 0.0700 0.0020 0.1200 0.0071 0.0008 0.0407 0.0016 0.0002 0.0128 0.0005 0.0001 0.0040 0.0001 0.0000 0.0012 0.0000 0.0000 0.0004 0.0000 0.0000 0.0001
Useful links: