Corrections or additions?
These articles by Barbara Fox and Michele Alperin were prepared
for the March 27, 2002 edition of U.S. 1 Newspaper. All rights
All Talk, Part I
It seemed like black magic in the late 1980s when
demonstrators showed the Dragon speech-to-text program: You talked
into the microphone and the computer wrote down your words. Now the
most unlikely objects — cars and television sets, for example, are
learning to talk and listen.
Phone directories, package delivery services, and manufacturers of
products for the disabled are important users of speech software.
Stockbrokers save millions annually by using speech recognition
software for automated stock quotes by telephone, and cell phones have
E-mail. Says Bill Meisel, of the California-based Speech Recognition
Update newsletter: "While venture financing has been tight in
general, speech companies are among the priorities for venture
capitalists returning to the trough."
Much of the basic research on speech software took place in New Jersey
at Bell Labs and the Institute for Defense Analyses. Yet though some
Princeton area companies are surviving in this space, none here have
struck gold, and several have completely vacated the speech arena.
Marvin Preston, a turnaround consultant with NewMarkets Inc. on
Street, thinks that
entrepreneurial struggles are endemic to the speech technology
A University of Michigan alumnus (Class of 1966), Preston had worked
for IBM, Ford Motor Company, and Exxon, and from 1987 to 1994 he was
the turn-around CEO of Texas-based Scott Instruments, which he
"Speech will never be as accurate as people expect it to be,
people hear different things, depending on what they are thinking
about," says Preston. "So many claims have been made about the
prospects for speech recognition companies that it has come to be
a bad joke — but it has a future. When people stop trying to
an unachievable perfection, and concentrate on making use of what
speech recognition really can do — they can make money with
In recent years seven Princeton companies have made
major investments in the three areas of speech software technology:
verification, recognition, and text-to-speech
identification. (See story below.)
and text-to-speech software for consumer devices. (See story below.)
warehouses. (See next story.)
talk to television sets and toaster ovens.
speech technology for next-generation text-to-speech applications.
and returned to its core business.
Speech recognition technology is particularly arcane because it offers
no achievable perfection, says Preston. "It tries to mimic human
hearing, which is imperfect. The ability to deal with the variability
makes it demanding for the scientist, yet you are responding to a
marketplace that thinks it needs perfection."
"It is classic," says Preston, "for people who understand
speech recognition technology to not have the entrepreneurial skills
to work with other people collaboratively to build a business. By
the time they have paid the tuition, they have lost the company. You
cannot find a speech recognition company to which this does not
— Barbara Fox
Sherlock Holmes could examine a footprint in the dust,
make deductions, and, bingo, figure out the identity of a murderer.
Today’s crime detection agents use more sophisticated
methods or behavioral science to down criminals. But the need to
identity has moved far beyond the crime scene to everyday life in
corporations and public venues of all kinds — primarily for
access to both physical and electronic space.
Over the last decade, entrepreneurs have begun marketing biometrics
that verify individual identity, ranging from fingerprints to pupil
recognition. Voiceprints, which are statistical records of the unique
characteristics of the human voice, are biometrics that can be
electronically to determine whether people are, literally, who they
say they are.
Five years ago, a group of former Applied Data Research employees
saw market potential in the ability to control electronic access via
voiceprints instead of passwords and plastic cards. They located
military-grade voiceprint verification technology in the United
Kingdom and at the same time interested venture capitalist John
in their idea. The result was VeriVoice, a seven-year-old corporation
founded by Joe Mannino, who is now CEO, and Barry Frankel, who has
since left the company. It moved from 5 Vaughn Drive, where it had
lodged with Torkelsen’s firm, Acorn Technology, to 501 Forrestal
Drive, where it has eight employees.
"Security is like the weather. Everyone was talking about
but no one was doing anything about it, until now. All the companies
have been around for years, but now the public is willing to accept
an added layer of security," says Mannino. Credit card companies
suffer billions of dollars in fraud annually, but rather than
their customers to catch the criminals, they had been passing along
the costs. "With voice verification technology, they could cut
fraud by orders of magnitude. It requires no sophistication to steal
a PIN number or forge a signature, but it would take elaborate
to even to try imitate a biometric."
It turns out that voiceprints are unique to individuals, in the same
way that fingerprints are. Technically, voiceprints are statistical
representations of a combination of selected human voice
Mannino says: "Although each competitor looks at a different
set of features, the kinds of things we all look at are pitch, mean
frequency, and the harmonics in speech." With respect to any
feature, a person’s voiceprint is not unique, but the combination
of features is slightly different for each person and, therefore,
The uniqueness of each voiceprint derives from the physical
that create it: the length and thickness of the voice box and the
vocal cords, the way the cords vibrate, and the effect of physical
environment surrounding the voice box on the echoes created within
it. Gregor Havkin, vice president of research and development for
VeriVoice, likens the human voice apparatus to a handmade violin.
"The more violins differ in subtle physical characteristics,
the more they will sound different," says Havkin.
Not even the most gifted mimic can reproduce the subtle nuances
in a voiceprint. "A person’s voice box is his voice box,"
observes Havkin. "It is like a bed. If we peel the sheets off,
it is still the same mattress and bedspring." So stable is this
vocal core that even when a person suffers from a cold, the voiceprint
is usually not compromised. "When we build the template,"
says Havkin, "the computer is sampling
the innate properties that the human ear takes in but cannot separate,
but the computer can."
When individuals record their voices on VeriVoice’s software, they
are asked to repeat several random sets of digits into a computer
microphone or a telephone. The software uses these utterances to
a voiceprint, which it keeps on file. Later, when the same individuals
seek access to a computer or to a building, they must "claim"
who they are by submitting an ID, perhaps a user name or a Social
Security number. This ID brings the "claimed"
voiceprint into current memory.
VeriVoice’s software compares this claimed voiceprint with a
print created in real time based on one, or at most two, random
of digits that the user is asked to repeat. The advantages of this
approach are twofold: the user does not need to remember a
pass phrase, and there is no way for a potential intruder to collect
a phrase from the genuine user and replay it with a tape recorder
or other device.
The actual verification procedure involves a statistical comparison
between the claimed voiceprint and the verification voiceprint. For
most situations, a 99.5 percent probability is good enough, and
setting the bar higher tends to screen out some legitimate users.
There are cases, though, where 99.5 percent is not good enough. For
example, where the software is used to give brokers access to clients’
accounts, 99.5 percent probability might not be good enough.
Voiceprints offer many advantages over passwords, which can be lost,
stolen, or corrupted. "Your voice is a biological part of you, and
you can’t give it to someone else," says Havkin. Nor can someone
steal a voiceprint from you.
Other biometrics used in security applications include fingerprinting,
iris or retina scans, and handwriting analysis. Voiceprints can be
less expensive than some of the alternatives because they require
only a telephone or a computer with a microphone and a sound card.
Havkin says voiceprints provide the same level of verification
as fingerprints read by a fingerprint reader (as distinguished from
fingerprints analyzed by human experts in criminal investigations,
which verify more accurately). Another point in favor of
he says, is that, unlike fingerprinting, it does not carry the stigma
of being associated with the criminal investigations.
The primary limitation Havkin sees on the use of voiceprints for
is the presence of loud voices or other loud noises that might
the voiceprint. As a result, voiceprints would not be a good choice
for verification in a crowded marketplace, in a convention hall, or
on an airport tarmac. The user interface is also important: To get
the most accurate identification, the user must keep the microphone
at a reasonably constant distance from the mouth and speak neither
too loudly nor too softly.
CEO Mannino came across biometric verification technology when
he served on the board of the Rutgers Center of Aids to Industrial
The son of Italian immigrants, he majored in mechanical engineering at
the University of Pennsylvania, and has worked for Applied Data
Research, Intel, and Oracle — always in leading product
development. He kept his job at Oracle until three years ago. "I’m
the guy with the arrows in his back — it keeps me intellectually
sharp," says Mannino.
Havkin graduated from Tel Aviv University in 1974 with a degree in
biology. He holds a Ph.D. in animal behavior from Dalhousie University
in Halifax and a degree in veterinary medicine from the University
of Pennsylvania. Havkin says that, before joining VeriVoice, he was
Boris Fridman’s vice president of research and development
at Nettech Systems, now called Broadbeam, and designed its initial
product offering. He joined VeriVoice in 1997.
The earliest application of voiceprint technology was in the
industry, where it is used to regulate movement of prisoners. Whereas
easily steal plastic cards or procure them by force, voiceprints are
inseparable from the individual.
Some companies offer
verification only as application service providers. Others offer only
software that is server based. With VeriVoice’s central product, the
Developers Toolkit, the
user retains complete freedom to develop applications. "We are
catering to companies that want to own the technology and
capability," says Jordan Byk, vice president of marketing
(Carnegie Mellon, Class of 1983, Rutgers MBA). VeriVoice’s
technology is flexible enough to fit entirely on a PDA or cell phone
or to be implemented on a server, laptop, or PC.
VeriLock, a generic application that provides security for PCs, can be
embedded into users’ applications, letting them know immediately when
someone is violating a security policy.
Thus far VeriVoice’s primary focus has been applications for securing
employee computers, accessing financial services environments,
time and attendance, and accessing medical records in health care
institutions and insurance companies. The last is a promising market
because the Health Insurance Portability and Accountability Act of
1996, which has its final implementation scheduled for 2003, requires
health care and insurance institutions to guarantee access to medical
records through a biometric application.
Byk says that VeriVoice expects to continue working with companies
from all sectors on system access applications. VeriVoice is also
working through vendors to develop password reset applications for
technology help desks, where 30 to 40 percent of phone calls come
from people who have forgotten their passwords. The company is also
providing verification for speech recognition companies like
Speechworks in Boston. Other applications well suited to voice
queuing situations, where lots of people are trying to enter the same
location at the same time, and enterprise software, where employees
need to log into multiple projects for purposes of computing time
and materials costs.
Byk puts the number of competitors to VeriVoice at 4 1/2. The
"1/2" represents one competitor that is now restructuring its
For Nuance, a large company in California, voice verification (which
verifies a speaker’s identity) is secondary to its main business in
voice recognition (which seeks to understand what is being said).
Because Nuance’s verification software is bundled with its recognizer,
a user must purchase both. The result is a higher price. Another
Byk sees on Nuance’s verification business is that it is not pursuing
the Web, but is using its product only with telephony.
Another VeriVoice competitor, Buytel, a company based in Ireland,
is entirely an application service provider (ASP); it does not sell
a product, but rather a verification service. Users must collect
enrollment and verification samples, ship them to Buytel’s server, and
wait for a verification.
Though Byk estimates the voiceprint market to be $100 to $300 million
over the next couple of years, people still consider the computer
predominantly a keyboard and mouse interface. "Even though they
are used to talking to and yelling at their computers," says Havkin,
"they haven’t made the move to using voice as the primary
Havkin expects that relatively soon what is natural with telephones
— interacting verbally — will soon become the case with
Just in the past year, he says, people have begun to use dictation
and voice command packages. And although most "desk side"
computers do not have built-in microphones, laptops generally do,
suggesting a "slow take" scenario in which people are moving
towards voice, but haven’t quite gotten there. Once people are ready
for voice, however, he believes that applications are just waiting
to happen. For example, he foresees replacing the elaborate
necessary to retrieve voicemail messages in some corporations with
voice verification software in each user’s phone.
VeriVoice is marketing its products through direct sales and through
value-added reseller agreements. Over the next couple of months,
VeriVoice expects to be staffing up its engineering and sales
organizations with several new hires.
In this field, one of VeriVoice’s customers (SpeechWorks) and one
of its competitors (Nuance) have reaped millions by going public.
Byk expects that VeriVoice will eventually be a good bet for a buyout,
either by a larger company that wants to consolidate multiple
biometrics into a single company or by a speech recognition company
to add voice verification to its product line. When asked where he
envisions VeriVoice five years from now, he responded, "Sold with
lots of payouts to all the employees."
— Michele Alperin
Princeton 08540. Joe Mannino, CEO. 609-452-9220; fax, 609-452-9228.
Home page: www.verivoice.com
With the increasing use of automobile navigation
cell phones, and automated phone answering systems, Americans are
spending more time talking to computers and listening to their
Whether the content of these "conversations" is traffic
stock quotes, or booking online reservations, successful communication
often depends upon accurate name pronunciation for people, places,
and businesses. This happens to be the specialty of five-year-old
E-Speech, a Princeton-based firm that produces name pronunciation
software for speech recognition and text-to-speech applications.
Marian Macchi and Dan Kahn, former Bell Labs researchers, founded
E-Speech in 1997. Having made a royalty deal with Telcordia (the
spinout), they took the core software with them and have since done
"I’d say the Bell technology they walked away with is very robust
and very sound," says Marvin Preston, the former CEO of a speech
technology firm. "It is terribly resource demanding. But if you
are embedding it in a phone system, it makes a lot of sense because
you are running a lot of calls through it."
Because Macchi and Kahn left Telcordia with a product and
customers, they have been able to proceed without outside funding,
yet they have not had significant growth. Macchi is still working
from her home on Cherry Hill Road.
Macchi says that one venue where name pronunciation is critical is
in call centers, where it can mean the difference between potential
customers or voters listening to the agents or hanging up in their
faces. E-Speech can provide these call agents with either a phonetic
transcription of each name in their database or the ability to click
on a name and listen to its pronunciation. "Call centers are a
market we’re trying to develop — in particular, helping agents
pronounce people’s names," says Macchi, noting that the company
has already completed one such contract for a political party.
E-Speech also supports most speech recognition. In most of these
the computer listens to an audio signal; figures out what phonemes
(phonetic speech sounds) it comprises; and then looks into a
database to find out what words have been spoken. But these
do not usually include entries for the names of people, places, and
businesses, and the quantity and demographic variability of names
mitigates against the creation of a static dictionary.
"In the United States, there are more than 2 million unique
surnames," Macchi explains. The repository of company names is
also changing, she says, reinforcing her point by reading from a list
of new names from current IPO filings: Kyphomn and Anteon and Altiris.
"A person can’t just sit and write transcriptions of every
name," says Macchi. "That would mean an impossibly large
amount of hand labor, and hand transcription of weird phonetic symbols
would mean lots of mistakes."
"We could never keep up with this stuff, so
we’ve figured out phonics rules to pronounce words and names."
E-Speech’s software includes about 1,500 phonics rules essential to
the correct pronunciation of names. One rule, for example, is that
"when a `th’ comes at the beginning of a word, it is pronounced
`th’ as in `thick.’" Any words in which a "th" at the
beginning of a word is not pronounced this way — for example,
the name "Thomas" and words like "the,"
and "therefore" — would go into a dictionary of
(E-Speech’s exceptions dictionary includes several thousand words
and names.) The software also takes into account the ethnic origin
of a name to ensure correct pronunciation; for example, it might
a French name from an "eaux" ending or a Japanese one from
a "fu," "ku," or "wa" ending.
Macchi and Kahn wrote the original software for a reverse phone
application when they worked at Bell Labs. Verizon today offers the
Call 54 service they developed: When a user dials 555-5454 and punches
in a phone number, the software responds by "saying" aloud
the name and address associated with that number. "The software
does a good job of pronouncing," says Macchi, "but it sounds
like a robot."
It turns out that in a reverse phone directory, customers are willing
to accept robot-like speech. But speech that is synthesized using
the 1,500 phonics rules can be difficult to understand and hard to
listen to for a long period. In response to customer desires for
voices in many applications, E-Speech and others in the industry are
developing methods to improve voice quality.
E-Speech’s approach sounds deceptively simple. Someone records huge
amounts of speech, for example, by reading a book. From this store
of recorded speech sounds, E-Speech’s software will select phonemic
strings roughly comparable to the speech it needs to synthesize and
concatenate them into sentences. If available, it will use an entire
sentence or phrase, like "How are you?" If only a single word
matches, it uses that. The next level is the syllable, and if nothing
else is available, it uses the 1,500 rules. Although this may sound
relatively straightforward, Macchi warns that "you have to be
smart about knowing what pieces to glue together."
This methodology, called natural-blended synthesis (because it blends
natural and synthesized speech), requires accessibility to significant
computer storage and could not, therefore, reside in a computer chip
or a cell phone. But the information could be stored at a central
site and then transmitted by phone or radio to users.
For applications with limited vocabularies and intonation, the quality
of speech can be nearly perfect. "If you only want to say certain
things," says Macchi, "for example, Hispanic names, we can
make a computer program that sounds fabulous." The process of
synthesizing Spanish names is fairly simple. First, someone records
the most common names, like Jose and Garcia, because, says Macchi,
"we want the most common names to sound perfect." Once the
common names are recorded, E-Speech uses proprietary algorithms to
construct less common names from pieces of the ones that were
The same synthesizer would not, however, work for full sentences,
because it would create sentences with the sing-song intonation of
E-Speech created the Hispanic name synthesizer for a customer who
wanted voice-dialing for Hispanic people — when callers say who
they wish to speak to, the software confirms by asking something like
"Did you say you are calling Jose Garcia?" and then places
the call. Another customer is now interested in developing similar
software for the English-speaking population.
Another E-Speech customer is a map company that sells databases of
Global Positioning System coordinates and street names to developers
of car navigation systems. E-Speech’s software provides them with
accurate pronunciations of all streets and towns in the United States.
E-Speech has also supplied technology to a semiconductor manufacturer
that wants to put a text-to-speech capability into chips for
such as cell phones, toys, appliances, and talking dictionaries. The
chip manufacturer is particularly interested in providing the
to read someone’s E-mail to them over the phone. So far, E-Speech
has developed versions of this software for English and Mandarin
but, adds Macchi, "we will probably do it for other languages,
too, if it sells."
Macchi says she has always been interested in language and computers.
The daughter of a chemical engineer, she majored in math and French
from Trinity College in the District of Columbia (Class of 1969 and
went to work
for Bell Labs, where she was involved in speech recognition and
Bell Labs sent her for a Ph.D. in linguistics at New York
During its last three years, E-Speech has had two full-time software
developers, Macchi and Kahn, and one part-time sales and marketing
person. When necessary, they hire extra personnel on a project basis.
Although most of their business is through referrals, they also go
to trade shows, put ads in trade magazines, and contact businesses
that might profit from their software.
"We were really growing like crazy until last year," says
Macchi, "I thought we’d have to hire more people — until the
middle of last year when the economy slowed down." But Macchi
has no worries about the future. She has lots of ideas for new
— like a car navigation system that also gives traffic reports,
reads E-mail, and provides stock quotes. "The field is in its
infancy," she says.
— Michele Alperin
Marian Macchi. 609-683-4340; fax, 609-683-4360. Home page:
Corrections or additions?
This page is published by PrincetonInfo.com
— the web site for U.S. 1 Newspaper in Princeton, New Jersey.