Speech Recognition Jukebox

ECE 476 SPRING 2007

FINAL PROJECT

Matthew Robbins and Arojit Saha

May 2, 2007

Table of Contents

Introduction

High Level Software Design

Capturing the Human Voice

Butterworth Digital Filters

Control Section

Audio Playback

Logical Structure

Hardware/Software Tradeoffs

Existing Patents and Trademarks

Program and Hardware Design

            Program Design

            Hardware Design

Microphone

High-Pass Filter

Low-Pass Filter

Non-Inverting Amplifier

Integration of Hardware Components

Television Circuit

Testing and Results

Conclusion

Appendices

Appendix 1

Appendix 2

Appendix 3

Appendix 4

Appendix 5

 

Introduction

 

 

For the Final Project in ECE 476: Designing with Microcontrollers, Robbins and Saha developed a Speech Recognition Jukebox, comprised of a speech recognition system that activated a simple music player.   The speech recognition system was capable of recognizing four commands and could cycle through a simple play list of three songs.  The jukebox could turn itself on, begin play, move between tracks, and stop play all through user voice commands.

 

In order to implement this design, Robbins and Saha needed to combine several different hardware and software elements.  A small microphone was purchased and used to convert the human voice signal into a voltage signal.  This alternating voltage signal was amplified by 1,000 times using three LM358 operational amplifiers.  Hardware frequency filters were used to limit the frequency input and software frequency filters were used to parse the signal into different frequency regions.

 

The values of the signal in these different frequency regions helped to determine each individual wordÕs unique digital ÔfingerprintÕ.  The fingerprints of important words, such as commands for the music-playing element of the design, were stored into the program.  Each time a word was spoken, the fingerprint of this sample word was compared to the stored fingerprints to determine which command, if any, was spoken.

 

Recognized commands for the system are:

 

ÒONÓ

Turn the music player on, play current song

ÒENDÓ

Pause the music player

ÒSOONÓ

Play the next song

ÒPREVÓ

Play the previous song

 

Table 1: Voice Commands Recognized by the System

 

Given the correct combination of commands, a simple music tune would be played on the speaker of the television.  A more in-depth analysis of the workings of both the software and hardware sections of the design can be found below.

Top of Page

 

High Level Software Design

 

Speech recognition systems have been implemented in a variety of different applications, most notably automated caller systems and security systems.  These systems have progressed considerably in recent years and have the capability of performing numerous tasks from simple user vocal commands.  For the ECE 476: Designing with Microcontrollers Final Project, Robbins and SahaÕs ambition was to combine speech recognition technology with music playback.  Robbins and Saha were inspired by the work of previous yearÕs groups, whose work is cited in Appendix 5, which demonstrated that such a project was realizable within the timing and hardware constraints of the ECE 476 Final Project parameters.

 

 

Capturing the Human Voice

 

The human hearing system is capable of capturing noise over a very wide frequency spectrum, from 20 Hz on the low frequency end to upwards of 20,000 Hz on the high frequency end.  The human voice, however, does not have this kind of range.  Typical frequencies for the human voice are on the order of 100 Hz to 2,000 Hz.  Robbins and Saha would have hardware electrical filters that would pass only the frequencies between approximately 150 Hz and 1,500 Hz and several digital Butterworth filters that would work to parse this frequency spectrum into smaller regions.  Both of these types of filters are discussed in more depth below. 

 

But how often should one sample a signal that is oscillating at these frequencies?  According to Nyquist Theory, the sampling rate should be twice as fast as the highest frequency of the signal, to ensure that there are at least 2 samples taken per signal period.  Thus, the sampling rate of the program would have to be no less than 4,000 samples per second.

 

Also, the human voice moves a sound wave, which compresses and decompresses the air as it moves.  As will be discussed below in the Hardware Design section, a microphone was utilized to convert this compression wave into an electrical signal that could be filtered, amplified, and analyzed.

Top of Page

 

Butterworth Digital Filters

 

The frequency spectrum of the human voice needed to be divided into several sub-intervals to allow analysis of the specific frequency spectrum of the word being spoken.  Robbins and Saha divided the frequency spectrum into seven (7) intervals using six 4-pole Butterworth band-pass filters and one 2-pole Butterworth high-pass filter.  The table below illustrates the scope of each filter:

 

Filter

Frequency Range

Band-Pass Filter #1

150 Hz – 350 Hz

Band-Pass Filter #2

350 Hz – 600 Hz

Band-Pass Filter #3

600 Hz – 850 Hz

Band-Pass Filter #4

850 Hz – 1100 Hz

Band-Pass Filter #5

1100 Hz – 1350 Hz

Band-Pass Filter #6

1350 Hz – 1600 Hz

High-Pass Filter

above 1600 Hz

 

Table 2: Frequency Range of Digital Filters

 

The Butterworth filter attempts to be linear and pass the input as close to unity as possible in the pass band.  In the program design, the Butterworth filters manipulated the A/D converter output into the frequency domain.  The code for both the high-pass Butterworth filter and the band-pass Butterworth filter were written by Bruce Land and can be found on the ECE 476 course website.  The band pass Butterworth equation is as follows:

 

 

Equation 1: Band-Pass Butterworth Filter

 

The high pass Butterworth equation is as follows:

 

 

Equation 2: High-Pass Butterworth Filter

 

After deciding on the sub-intervals for the digital filters, Robbins and Saha wrote a MATLAB function to find the b1, a2, and a3 coefficients for all seven filters.  The coefficients were found using the butter() function in MATLAB. 

 

Top of Page

 

Control Section

 

The output of the digital filters would help to formulate a digital ÔfingerprintÕ that was unique for each word.  Five samples were taken from each digital filter, thus yielding 35 total samples that would comprise the digital fingerprint of each word.  The fingerprints of the dictionary words, ÒONÓ, ÒENDÓ, ÒPREVÓ, ÒSOONÓ, were stored in the software program.  Whenever the user input a command to the system, this sampleÕs digital fingerprint would be calculated and then compared to each of the dictionary words. 

 

To compare the dictionary words with the sample, the program calculated the correlation of the two vectors.  The pair with the highest absolute value correlation was chosen as a match.  When an input command word was recognized as a dictionary word, the control section would set a series of flags that would update the state machine.  This state machine would change state on these flags being set and each state corresponded to a separate song being played. 

 

Top of Page

 

Audio Playback

 

Robbins and Saha chose three songs to be played by the jukebox - a Sonatina written by W.A. Mozart, ÒOde to JoyÓ written by Ludwig van Beethoven, and the Star Spangled Banner.  These songs were chosen because of their simple melody and easy recognition.  Using the audio production code provided in Lab 4: Digital Oscilloscope, shown below, these songs notes were converted into a format that could be played on the television speaker. 

 

Note

C

D

E

F

G

A

B

C

D

E

F

G

A

B

C

Rest

Value

239

213

189

179

159

142

126

120

106

94

90

80

71

63

60

0

 

Table 3: Conversion Table for Musical Notes

            (Bold C corresponds to middle C)

Top of Page

 

Logical Structure

 

The logical structure of the program is quite simple.  The user will speak the desired command into the microphone.  The microphone will convert this audio signal into an electrical signal, which will then be filtered and amplified before being sent to the A to D converter.  The program A to D samples the input, and the output of the A to D converter is run through seven digital filters.  The control section uses the outputs of the seven digital filters to obtain a working fingerprint of the spoken command and compares this fingerprint with those stored fingerprints to decipher which command, if any, has been spoken.  Upon recognizing a user command, a state machine within the control section will change state.  Each state of this state machine corresponds to a separate song being activated.  Thus, upon changing state, a different song signal will be sent to the television audio connection, enable music playback.  A simple schematic of the logical structure can be found below in Figure 1. 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Figure 1: Logical Structure of Speech Recognition Jukebox

 

Top of Page

 

Hardware/Software Tradeoffs

 

To be able to execute all the commands in the program, there need to be enough clock cycles.  The Mega32 clock runs at 16 MHz (16 million clock cycles per second).  As the code requires that the A to D converter be sampled at a rate of 4 kHz, all the code for the program must be able to execute in 4,000 clock cycles (16 million / 4 kHz).  Thus, the hardware must be able to work in real time and not further limit the capabilities of the program.  As the hardware is mostly comprised of resistors and capacitors, and the LM358 is a relatively fast op-amp, there are no concerns with regard to hardware affecting the software. 

 

The only constraint remains that all the computations performed by the program be able to fit it 4,000 clock cycles.  The seven digital filters will consume the majority of the clock cycles.  Each 4-pole band-pass Butterworth filter takes up 228 clock cycles and the 2-pole high-pass Butterworth filter takes up 148 cycles.  Thus, all the filters together will consume 1,516 cycles.  This yields almost 2,500 clock cycles for the remainder of the code, which is more than enough space.

 

Top of Page

 

Existing Patents and Trademarks

 

Several phone and technology companies, notably AT&T and Microsoft, have patented speech recognition technology.  Robbins and Saha do not believe that their design will infringe the rights of these companiesÕ patents as it will a unique, novel and non-obvious approach to speech recognition using original hardware and software design.

Top of Page

 

Program and Hardware Design

 

Program Design

 

The dataflow of the program begins with the output of the A/D converter.  This value is stored in the variable Atemp.  Atemp is set in the Timer/Counter 1 interrupt, which runs every 250 ms (4,000 times per second).  Atemp is then passed to the seven digital Butterworth filters using a function called setfilters(), which is also run in the interrupt.  After the filters have been set, the program enters the player() function, which contains the state machine that runs the voice recognition section of our program. 

 

The player() function is broken up into six states:  TAKE, WAIT1, ON, END, AFTER, LAST.  The TAKE state is considered to be the off state of the jukebox.  When button 7 is pressed on the STK500 board, the player turns on.  The user will have to press button 6 to use the voice recognition portion of the state machine.  Upon this button being pressed, the state machine is in the WAIT1 state.  In this state, the state machine is waiting for the user to say the word ÒON.Ó  This signals to the state machine that the user wishes to start the player.  After the user says ÒON,Ó the state machine enters the ON state and begins playing song 1 (ÒOde to JoyÓ).

 

Once in the ON state, the voice recognition state machine has four possible routes.  If the user says ÒSOON,Ó the state machine assumes the user wants to play the next song (song 2).  If the user says ÒPREV,Ó the state machine assumes the user wants to play the previous song (song 3).  The user can also say ÒEND,Ó indicating the user wants to pause the playback of the song.  Based on whether the user says ÒSOONÓ, ÒPREVÓ, or ÒENDÓ, the player state machine enters the AFTER, LAST, or END states, respectively. 

 

In the AFTER state, the state machine plays song 2.  If the user says ÒSOONÓ, the state machine enters the LAST state and plays song 3.  If the user says ÒPREVÓ, the state machine enters the ON state and plays song 1.   In the LAST state, the state machine plays song 3.  If the user says ÒSOONÓ, the state machine plays enters the ON state and plays song 1.  If the user says ÒPREVÓ, the state machine enters the AFTER state and plays song 2.  If at any time button 7 is pressed, the state machine goes back to the TAKE state and the player has been turned off.  A diagram of this state machine is found below.

 

 

 

 

 

 

 

 

 

 


Figure 2: Diagram of player() state machine

 

In the player() state machine, the voice recognition system is always running.  The samples coming in from the Butterworth filters are compared to a set of dictionary fingerprints.  A correlation function is run to see which dictionary fingerprint most corresponds to the sample.  Whichever dictionary fingerprint produces the highest (closest to 1) absolute value is most similar to the word being spoken by the user.  This section involved the most debugging of our program.  Initially, we had the user input in various dictionary definitions at the start of the player() state machine. 

 

However, every sample is different and consistency could not ensured every time the program was run.  For this reason, Robbins and Saha created a different program that saves words and outputs these words in the Hyperterm terminal.  This program was used to create dictionary fingerprints and to store them in SRAM. Robbins and Saha took two samples each for every dictionary word.  The inspiration for this idea came from the Voice Controlled Car from the Spring 2006 semester of ECE 476, whose code is referenced in the Appendices.

 

Another problem with the system that required considerable debugging was that initially Robbins and Saha used Euclidean distances to relate samples to dictionary fingerprints.  However, this approach was fairly inconsistent and did not work often enough to be useful.  This inconsistency was due to the variation between samples.   While looking through the Spring 2006 semester of ECE 476, Robbins and Saha saw the Voice Recognition Security System used correlation to relate samples to dictionary fingerprints and had a increase in recognition rate.  This groupÕs code is also referenced in the Appendices.   

 

Based on their design, Robbins and Saha decided to try correlation and had an increase in recognition rate.  This approach was proven to be more successful thanks to outputting the state of the player() state machine to the Hyperterm terminal after a sample was spoken.  Robbins and Saha also had problems with recognition of certain words over other words.  Several words were tried before deciding on the final list including, ÒNEXTÓ, ÒAFTERÓ,  ÒSTOPÓ, and ÒPAUSEÓ.

 

Top of Page

 

Hardware Design

 

As mentioned above in the High Level Software Design section, the human voice is comprised of numerous different frequencies emitted as a compression wave through the air.  In order to perform analysis on a vocal sample, this compression wave would need to be transformed into an electrical signal using a microphone.  The electrical output of the microphone was filtered and amplified several times in order to produce a clean and responsive voltage signal.  Each of the separate hardware components used to perform these tasks is discussed individually below, followed by a discussion of each sectionÕs integration and specific design choices made by Robbins and Saha.

 

Microphone Circuit

 

Microphone

 

To convert the human voice compression wave to a voltage signal, Robbins and Saha used a microphone purchased a small microphone (Part# 423-1027-ND) from the Digi-Key Corporation.  This microphoneÕs ground and output connections needed to be soldered to the white board and the output was then filtered using a high-pass filter.  As can be seen on the data sheet, this specific microphone had an operating frequency range of 300 Hz to 6,000 Hz, which is an appropriate frequency range for measurin