VeriVoice’s Future: Voice Print Security

E-Speech: Digitizing Pronunciation Skills

Corrections or additions?

These articles by Barbara Fox and Michele Alperin were prepared

for the March 27, 2002 edition of U.S. 1 Newspaper. All rights

reserved.

All Talk, Part I

It seemed like black magic in the late 1980s when

demonstrators showed the Dragon speech-to-text program: You talked

into the microphone and the computer wrote down your words. Now the

most unlikely objects — cars and television sets, for example, are

learning to talk and listen.

Phone directories, package delivery services, and manufacturers of

products for the disabled are important users of speech software.

Stockbrokers save millions annually by using speech recognition

software for automated stock quotes by telephone, and cell phones have

E-mail. Says Bill Meisel, of the California-based Speech Recognition

Update newsletter: "While venture financing has been tight in

general, speech companies are among the priorities for venture

capitalists returning to the trough."

Much of the basic research on speech software took place in New Jersey

at Bell Labs and the Institute for Defense Analyses. Yet though some

Princeton area companies are surviving in this space, none here have

struck gold, and several have completely vacated the speech arena.

Marvin Preston, a turnaround consultant with NewMarkets Inc. on

Prospect

Street, thinks that

entrepreneurial struggles are endemic to the speech technology

industry.

A University of Michigan alumnus (Class of 1966), Preston had worked

for IBM, Ford Motor Company, and Exxon, and from 1987 to 1994 he was

the turn-around CEO of Texas-based Scott Instruments, which he

successfully

sold.

"Speech will never be as accurate as people expect it to be,

because

people hear different things, depending on what they are thinking

about," says Preston. "So many claims have been made about the

prospects for speech recognition companies that it has come to be

a bad joke — but it has a future. When people stop trying to

achieve

an unachievable perfection, and concentrate on making use of what

speech recognition really can do — they can make money with

it."

In recent years seven Princeton companies have made

major investments in the three areas of speech software technology:

verification, recognition, and text-to-speech

software.

VeriVoice uses voiceprints for accurate biometric

identification. (See story below.)

E-Speech has name pronunciation software for E-commerce

and text-to-speech software for consumer devices. (See story below.)

Voxware is pursuing industrial speech applications for

warehouses. (See next story.)

Sarnoff Corporation researchers are making it easier to

talk to television sets and toaster ovens.

Siemens Corporate Research scientists have 3-D streaming

speech technology for next-generation text-to-speech applications.

IsSound was working with text-to-speech but has closed

down.

Ficomp has abandoned its speech-to-text product for

stockbrokers

and returned to its core business.

Speech recognition technology is particularly arcane because it offers

no achievable perfection, says Preston. "It tries to mimic human

hearing, which is imperfect. The ability to deal with the variability

makes it demanding for the scientist, yet you are responding to a

marketplace that thinks it needs perfection."

"It is classic," says Preston, "for people who understand

speech recognition technology to not have the entrepreneurial skills

to work with other people collaboratively to build a business. By

the time they have paid the tuition, they have lost the company. You

cannot find a speech recognition company to which this does not

apply."

— Barbara Fox

Top Of Page
VeriVoice’s Future: Voice Print Security

Sherlock Holmes could examine a footprint in the dust,

make deductions, and, bingo, figure out the identity of a murderer.

Today’s crime detection agents use more sophisticated

biometrics-automated

methods or behavioral science to down criminals. But the need to

determine

identity has moved far beyond the crime scene to everyday life in

corporations and public venues of all kinds — primarily for

controlling

access to both physical and electronic space.

Over the last decade, entrepreneurs have begun marketing biometrics

that verify individual identity, ranging from fingerprints to pupil

recognition. Voiceprints, which are statistical records of the unique

characteristics of the human voice, are biometrics that can be

examined

electronically to determine whether people are, literally, who they

say they are.

Five years ago, a group of former Applied Data Research employees

saw market potential in the ability to control electronic access via

voiceprints instead of passwords and plastic cards. They located

military-grade voiceprint verification technology in the United

Kingdom and at the same time interested venture capitalist John

Torkelsen

in their idea. The result was VeriVoice, a seven-year-old corporation

founded by Joe Mannino, who is now CEO, and Barry Frankel, who has

since left the company. It moved from 5 Vaughn Drive, where it had

lodged with Torkelsen’s firm, Acorn Technology, to 501 Forrestal

Drive, where it has eight employees.

"Security is like the weather. Everyone was talking about

security,

but no one was doing anything about it, until now. All the companies

have been around for years, but now the public is willing to accept

an added layer of security," says Mannino. Credit card companies

suffer billions of dollars in fraud annually, but rather than

aggravate

their customers to catch the criminals, they had been passing along

the costs. "With voice verification technology, they could cut

fraud by orders of magnitude. It requires no sophistication to steal

a PIN number or forge a signature, but it would take elaborate

equipment

to even to try imitate a biometric."

It turns out that voiceprints are unique to individuals, in the same

way that fingerprints are. Technically, voiceprints are statistical

representations of a combination of selected human voice

characteristics.

Mannino says: "Although each competitor looks at a different

set of features, the kinds of things we all look at are pitch, mean

frequency, and the harmonics in speech." With respect to any

single

feature, a person’s voiceprint is not unique, but the combination

of features is slightly different for each person and, therefore,

distinguishable.

The uniqueness of each voiceprint derives from the physical

characteristics

that create it: the length and thickness of the voice box and the

vocal cords, the way the cords vibrate, and the effect of physical

environment surrounding the voice box on the echoes created within

it. Gregor Havkin, vice president of research and development for

VeriVoice, likens the human voice apparatus to a handmade violin.

"The more violins differ in subtle physical characteristics,

the more they will sound different," says Havkin.

Not even the most gifted mimic can reproduce the subtle nuances

captured

in a voiceprint. "A person’s voice box is his voice box,"

observes Havkin. "It is like a bed. If we peel the sheets off,

it is still the same mattress and bedspring." So stable is this

vocal core that even when a person suffers from a cold, the voiceprint

is usually not compromised. "When we build the template,"

says Havkin, "the computer is sampling

the innate properties that the human ear takes in but cannot separate,

but the computer can."

When individuals record their voices on VeriVoice’s software, they

are asked to repeat several random sets of digits into a computer

microphone or a telephone. The software uses these utterances to

create

a voiceprint, which it keeps on file. Later, when the same individuals

seek access to a computer or to a building, they must "claim"

who they are by submitting an ID, perhaps a user name or a Social

Security number. This ID brings the "claimed"

voiceprint into current memory.

VeriVoice’s software compares this claimed voiceprint with a

"verification"

print created in real time based on one, or at most two, random

strings

of digits that the user is asked to repeat. The advantages of this

approach are twofold: the user does not need to remember a

pre-assigned

pass phrase, and there is no way for a potential intruder to collect

a phrase from the genuine user and replay it with a tape recorder

or other device.

The actual verification procedure involves a statistical comparison

between the claimed voiceprint and the verification voiceprint. For

most situations, a 99.5 percent probability is good enough, and

setting the bar higher tends to screen out some legitimate users.

There are cases, though, where 99.5 percent is not good enough. For

example, where the software is used to give brokers access to clients’

accounts, 99.5 percent probability might not be good enough.

Voiceprints offer many advantages over passwords, which can be lost,

stolen, or corrupted. "Your voice is a biological part of you, and

you can’t give it to someone else," says Havkin. Nor can someone

steal a voiceprint from you.

Other biometrics used in security applications include fingerprinting,

iris or retina scans, and handwriting analysis. Voiceprints can be

less expensive than some of the alternatives because they require

only a telephone or a computer with a microphone and a sound card.

Havkin says voiceprints provide the same level of verification

accuracy

as fingerprints read by a fingerprint reader (as distinguished from

fingerprints analyzed by human experts in criminal investigations,

which verify more accurately). Another point in favor of

voiceprinting,

he says, is that, unlike fingerprinting, it does not carry the stigma

of being associated with the criminal investigations.

The primary limitation Havkin sees on the use of voiceprints for

verification

is the presence of loud voices or other loud noises that might

compromise

the voiceprint. As a result, voiceprints would not be a good choice

for verification in a crowded marketplace, in a convention hall, or

on an airport tarmac. The user interface is also important: To get

the most accurate identification, the user must keep the microphone

at a reasonably constant distance from the mouth and speak neither

too loudly nor too softly.

CEO Mannino came across biometric verification technology when

he served on the board of the Rutgers Center of Aids to Industrial

Productivity.

The son of Italian immigrants, he majored in mechanical engineering at

the University of Pennsylvania, and has worked for Applied Data

Research, Intel, and Oracle — always in leading product

development. He kept his job at Oracle until three years ago. "I’m

the guy with the arrows in his back — it keeps me intellectually

sharp," says Mannino.

Havkin graduated from Tel Aviv University in 1974 with a degree in

biology. He holds a Ph.D. in animal behavior from Dalhousie University

in Halifax and a degree in veterinary medicine from the University

of Pennsylvania. Havkin says that, before joining VeriVoice, he was

Boris Fridman’s vice president of research and development

at Nettech Systems, now called Broadbeam, and designed its initial

product offering. He joined VeriVoice in 1997.

The earliest application of voiceprint technology was in the

corrections

industry, where it is used to regulate movement of prisoners. Whereas

convicts might

easily steal plastic cards or procure them by force, voiceprints are

inseparable from the individual.

Some companies offer

verification only as application service providers. Others offer only

software that is server based. With VeriVoice’s central product, the

Software

Developers Toolkit, the

user retains complete freedom to develop applications. "We are

catering to companies that want to own the technology and

decision-making

capability," says Jordan Byk, vice president of marketing

(Carnegie Mellon, Class of 1983, Rutgers MBA). VeriVoice’s

technology is flexible enough to fit entirely on a PDA or cell phone

or to be implemented on a server, laptop, or PC.

VeriLock, a generic application that provides security for PCs, can be

embedded into users’ applications, letting them know immediately when

someone is violating a security policy.

Thus far VeriVoice’s primary focus has been applications for securing

employee computers, accessing financial services environments,

validating

time and attendance, and accessing medical records in health care

institutions and insurance companies. The last is a promising market

because the Health Insurance Portability and Accountability Act of

1996, which has its final implementation scheduled for 2003, requires

health care and insurance institutions to guarantee access to medical

records through a biometric application.

Byk says that VeriVoice expects to continue working with companies

from all sectors on system access applications. VeriVoice is also

working through vendors to develop password reset applications for

technology help desks, where 30 to 40 percent of phone calls come

from people who have forgotten their passwords. The company is also

providing verification for speech recognition companies like

Speechworks in Boston. Other applications well suited to voice

verification are

queuing situations, where lots of people are trying to enter the same

location at the same time, and enterprise software, where employees

need to log into multiple projects for purposes of computing time

and materials costs.

Byk puts the number of competitors to VeriVoice at 4 1/2. The

"1/2" represents one competitor that is now restructuring its

approach.

For Nuance, a large company in California, voice verification (which

verifies a speaker’s identity) is secondary to its main business in

voice recognition (which seeks to understand what is being said).

Because Nuance’s verification software is bundled with its recognizer,

a user must purchase both. The result is a higher price. Another

limitation

Byk sees on Nuance’s verification business is that it is not pursuing

the Web, but is using its product only with telephony.

Another VeriVoice competitor, Buytel, a company based in Ireland,

is entirely an application service provider (ASP); it does not sell

a product, but rather a verification service. Users must collect

enrollment and verification samples, ship them to Buytel’s server, and

wait for a verification.

Though Byk estimates the voiceprint market to be $100 to $300 million

over the next couple of years, people still consider the computer

predominantly a keyboard and mouse interface. "Even though they

are used to talking to and yelling at their computers," says Havkin,

"they haven’t made the move to using voice as the primary

interface."

Havkin expects that relatively soon what is natural with telephones

— interacting verbally — will soon become the case with

computers.

Just in the past year, he says, people have begun to use dictation

and voice command packages. And although most "desk side"

computers do not have built-in microphones, laptops generally do,

suggesting a "slow take" scenario in which people are moving

towards voice, but haven’t quite gotten there. Once people are ready

for voice, however, he believes that applications are just waiting

to happen. For example, he foresees replacing the elaborate

machinations

necessary to retrieve voicemail messages in some corporations with

voice verification software in each user’s phone.

VeriVoice is marketing its products through direct sales and through

value-added reseller agreements. Over the next couple of months,

VeriVoice expects to be staffing up its engineering and sales

organizations with several new hires.

In this field, one of VeriVoice’s customers (SpeechWorks) and one

of its competitors (Nuance) have reaped millions by going public.

Byk expects that VeriVoice will eventually be a good bet for a buyout,

either by a larger company that wants to consolidate multiple

biometrics into a single company or by a speech recognition company

that wants

to add voice verification to its product line. When asked where he

envisions VeriVoice five years from now, he responded, "Sold with

lots of payouts to all the employees."

— Michele Alperin

VeriVoice Inc., 501 Forrestal Road, Suite 326,

Princeton 08540. Joe Mannino, CEO. 609-452-9220; fax, 609-452-9228.

Home page: www.verivoice.com

Top Of Page
E-Speech: Digitizing Pronunciation Skills

With the increasing use of automobile navigation

systems,

cell phones, and automated phone answering systems, Americans are

spending more time talking to computers and listening to their

responses.

Whether the content of these "conversations" is traffic

information,

stock quotes, or booking online reservations, successful communication

often depends upon accurate name pronunciation for people, places,

and businesses. This happens to be the specialty of five-year-old

E-Speech, a Princeton-based firm that produces name pronunciation

software for speech recognition and text-to-speech applications.

Marian Macchi and Dan Kahn, former Bell Labs researchers, founded

E-Speech in 1997. Having made a royalty deal with Telcordia (the

Bellcore

spinout), they took the core software with them and have since done

additional development.

"I’d say the Bell technology they walked away with is very robust

and very sound," says Marvin Preston, the former CEO of a speech

technology firm. "It is terribly resource demanding. But if you

are embedding it in a phone system, it makes a lot of sense because

you are running a lot of calls through it."

Because Macchi and Kahn left Telcordia with a product and

long-established

customers, they have been able to proceed without outside funding,

yet they have not had significant growth. Macchi is still working

from her home on Cherry Hill Road.

Macchi says that one venue where name pronunciation is critical is

in call centers, where it can mean the difference between potential

customers or voters listening to the agents or hanging up in their

faces. E-Speech can provide these call agents with either a phonetic

transcription of each name in their database or the ability to click

on a name and listen to its pronunciation. "Call centers are a

market we’re trying to develop — in particular, helping agents

pronounce people’s names," says Macchi, noting that the company

has already completed one such contract for a political party.

E-Speech also supports most speech recognition. In most of these

systems,

the computer listens to an audio signal; figures out what phonemes

(phonetic speech sounds) it comprises; and then looks into a

dictionary

database to find out what words have been spoken. But these

dictionaries

do not usually include entries for the names of people, places, and

businesses, and the quantity and demographic variability of names

mitigates against the creation of a static dictionary.

"In the United States, there are more than 2 million unique

surnames," Macchi explains. The repository of company names is

also changing, she says, reinforcing her point by reading from a list

of new names from current IPO filings: Kyphomn and Anteon and Altiris.

"A person can’t just sit and write transcriptions of every

existing

name," says Macchi. "That would mean an impossibly large

amount of hand labor, and hand transcription of weird phonetic symbols

would mean lots of mistakes."

"We could never keep up with this stuff, so

we’ve figured out phonics rules to pronounce words and names."

E-Speech’s software includes about 1,500 phonics rules essential to

the correct pronunciation of names. One rule, for example, is that

"when a `th’ comes at the beginning of a word, it is pronounced

`th’ as in `thick.’" Any words in which a "th" at the

beginning of a word is not pronounced this way — for example,

the name "Thomas" and words like "the,"

"then,"

and "therefore" — would go into a dictionary of

exceptions.

(E-Speech’s exceptions dictionary includes several thousand words

and names.) The software also takes into account the ethnic origin

of a name to ensure correct pronunciation; for example, it might

identify

a French name from an "eaux" ending or a Japanese one from

a "fu," "ku," or "wa" ending.

Macchi and Kahn wrote the original software for a reverse phone

directory

application when they worked at Bell Labs. Verizon today offers the

Call 54 service they developed: When a user dials 555-5454 and punches

in a phone number, the software responds by "saying" aloud

the name and address associated with that number. "The software

does a good job of pronouncing," says Macchi, "but it sounds

like a robot."

It turns out that in a reverse phone directory, customers are willing

to accept robot-like speech. But speech that is synthesized using

the 1,500 phonics rules can be difficult to understand and hard to

listen to for a long period. In response to customer desires for

human-like

voices in many applications, E-Speech and others in the industry are

developing methods to improve voice quality.

E-Speech’s approach sounds deceptively simple. Someone records huge

amounts of speech, for example, by reading a book. From this store

of recorded speech sounds, E-Speech’s software will select phonemic

strings roughly comparable to the speech it needs to synthesize and

concatenate them into sentences. If available, it will use an entire

sentence or phrase, like "How are you?" If only a single word

matches, it uses that. The next level is the syllable, and if nothing

else is available, it uses the 1,500 rules. Although this may sound

relatively straightforward, Macchi warns that "you have to be

smart about knowing what pieces to glue together."

This methodology, called natural-blended synthesis (because it blends

natural and synthesized speech), requires accessibility to significant

computer storage and could not, therefore, reside in a computer chip

or a cell phone. But the information could be stored at a central

site and then transmitted by phone or radio to users.

For applications with limited vocabularies and intonation, the quality

of speech can be nearly perfect. "If you only want to say certain

things," says Macchi, "for example, Hispanic names, we can

make a computer program that sounds fabulous." The process of

synthesizing Spanish names is fairly simple. First, someone records

the most common names, like Jose and Garcia, because, says Macchi,

"we want the most common names to sound perfect." Once the

common names are recorded, E-Speech uses proprietary algorithms to

construct less common names from pieces of the ones that were

recorded.

The same synthesizer would not, however, work for full sentences,

because it would create sentences with the sing-song intonation of

names.

E-Speech created the Hispanic name synthesizer for a customer who

wanted voice-dialing for Hispanic people — when callers say who

they wish to speak to, the software confirms by asking something like

"Did you say you are calling Jose Garcia?" and then places

the call. Another customer is now interested in developing similar

software for the English-speaking population.

Another E-Speech customer is a map company that sells databases of

Global Positioning System coordinates and street names to developers

of car navigation systems. E-Speech’s software provides them with

accurate pronunciations of all streets and towns in the United States.

E-Speech has also supplied technology to a semiconductor manufacturer

that wants to put a text-to-speech capability into chips for

applications

such as cell phones, toys, appliances, and talking dictionaries. The

chip manufacturer is particularly interested in providing the

capability

to read someone’s E-mail to them over the phone. So far, E-Speech

has developed versions of this software for English and Mandarin

Chinese,

but, adds Macchi, "we will probably do it for other languages,

too, if it sells."

Macchi says she has always been interested in language and computers.

The daughter of a chemical engineer, she majored in math and French

from Trinity College in the District of Columbia (Class of 1969 and

went to work

immediately

for Bell Labs, where she was involved in speech recognition and

synthesis.

Bell Labs sent her for a Ph.D. in linguistics at New York

University.

During its last three years, E-Speech has had two full-time software

developers, Macchi and Kahn, and one part-time sales and marketing

person. When necessary, they hire extra personnel on a project basis.

Although most of their business is through referrals, they also go

to trade shows, put ads in trade magazines, and contact businesses

that might profit from their software.

"We were really growing like crazy until last year," says

Macchi, "I thought we’d have to hire more people — until the

middle of last year when the economy slowed down." But Macchi

has no worries about the future. She has lots of ideas for new

products

— like a car navigation system that also gives traffic reports,

reads E-mail, and provides stock quotes. "The field is in its

infancy," she says.

— Michele Alperin

E-Speech, 448 Cherry Hill Road, Princeton 08540.

Marian Macchi. 609-683-4340; fax, 609-683-4360. Home page:

www.espeech.com


Next Story


Corrections or additions?


This page is published by PrincetonInfo.com

— the web site for U.S. 1 Newspaper in Princeton, New Jersey.

Facebook Comments