Corrections or additions?
These stories by Barbara Fox and Peter J. Mladineo were published
in U.S. 1 Newspaper on May 27, 1998. All rights reserved.
Search Engines and E-Commerce
Because it makes every scientist’s research instantly
accessible to every other scientist, the World Wide Web has speeded
up the progress of discovery in an amazing way. In times past, one
scientist would read an abstract in a journal, then mail a request
for a copy of the full paper, a process that could take weeks or even
Now a scientist can use a search engine to do a word search on a topic
on the World Wide Web and get an immediate response.
It sounds ideal. But two NEC researchers at 4 Independence Way, C.
Lee Giles and Steve Lawrence, have proved that recovery of scientific
information from the Web is not ideal. Their paper "Searching
the World Wide Web," published in the April 3 issue of Science
Magazine, documents in dismaying detail that no single search engine
indexes more than about one-third of the Web pages available.
"Before our study, there was no convincing evidence that the
engines didn’t cover most of the world," says Lawrence. "It
is a rigorous statistical study."
The media — the New York Times, Wall Street Journal, Associated
Press, National Public Radio, MSNBC — latched onto this story
with enthusiasm, and Giles and Lawrence found themselves giving phone
interviewers to reporters all over the world. These two experts in
artificial intelligence and neural networks had, until now, worked
in such high-falutin’ areas as facial recognition patterns, currency
exchange predictions, and natural language processing. Now their
could have an impact on how a fourth-grader did her homework.
Most reporters focused on the "Emperor Has No Clothes" aspect
of the report, the conclusion that search engines are missing most
of the available pages. Lawrence and Giles had searched the Web for
a two-day period in December: They found that HotBot indexed the most
pages (34 percent), followed by AltaVista (28 percent), Northern Light
(20 percent), Excite (14 percent), Infoseek (10 percent), and Lycos
(only 3 percent). A search with all six engines, nevertheless,
only 60 percent of the indexable Web.
But "numbers of pages retrieved" does not tell the whole
and here’s why: A search engine goes out to the World Wide Web,
words on the pages it finds, then stores those indexed pages in its
own archive. When you summon a search engine, it doesn’t roam the
Web to get your answer; it retrieves pages from its archive. So if
an archive isn’t cleaned out regularly, it will bring up links to
useless pages that may not even exist.
Giles and Lawrence found that if accuracy or the immediacy of the
updating is important, low-retrieving Lycos is the best engine —
it had the freshest material, compared to high-retrieving HotBot,
which coughed up the most dead links. Comprehensiveness, they found,
is a tradeoff for freshness.
A less negative way to look at their research is that they found the
Web to be more voluminous than anyone previously estimated, thus the
percentage of the total pages found is smaller than estimated. Giles
and Lawrence claim that the indexable Web (the Web not guarded by
passwords and firewalls) now has at least 320 million pages, compared
with previous guesses of 175 to 250 million.
Soon their phones were literally ringing off the hook: their home
list more than a full page of links to stories published around the
world. Most reporters were accurate, they say, though they laugh about
how frequently Lawrence’s quotes were attributed to Giles and vice
versa. That’s surprising, even for a dual interview, because Steve
Lawrence’s speech is distinguished by a strong Australian accent.
Nevertheless Giles and Lawrence are alike in many ways:
They share a fervent commitment to basic research, finding out how
things work without needing to immediately apply the knowledge to
an actual project. They are among more than 100 people at 4
Way doing basic research at NEC Research Institute. Another 45 people
at NEC USA C&C Research Laboratories are focussed on applied research.
And three dozen employees do software development for UNIX products
in NEC’s Open Systems Technology Center — also at 4 Independence
"In some sense, it’s our job, to do research and come up with
results," says Lawrence. "The atmosphere here is very
to doing basic research."
"I wake up and say, Isn’t it great I am being paid to do
says Giles. "I think we have fun here.
Born in Memphis, Tennessee, C. Lee Giles stayed at home to attend
Rhodes College. He earned doctor’s degrees from the University of
Michigan and the University of Arizona. He worked for Ford Motor
Research in Dearborn, Michigan, taught electrical engineering and
computer engineering at Clarkson University, and then worked at the
Naval Research Laboratory in Washington, D.C. Married and with a
daughter, his research areas include machine learning, optics, and
artificial intelligence neural networks.
Perhaps because Lawrence has considerable experience in theater and
improvisational comedy, he is the more talkative of the two. He grew
up in Queensland, Australia, where his father was an accountant, and
went to Queensland University of Technology, earning two undergraduate
degrees in 1993. After finishing his Ph.D. in Queensland, he came
to NEC Research Institute in 1996. "Lee was one of the major
that led me to pursue research as opposed to going into industry,"
Why did it take two artificial intelligence experts to do this study?
Why couldn’t two librarians or two statisticians do it just as well?
To do it efficiently Giles and Lawrence had to use NEC’s internal
search engine, Inquirus, which they had previously created. Over a
period of a year, on and off ("mostly off," says Giles) they
took queries that other NEC scientists had made, analyzed them, boiled
them down to 572 queries, wrote a program with restrictive parameters,
and fed it into Inquirus. Then from December 15 to 17 they just
Among their parameters: to count all documents
then count only the documents with the query term exact (not plural)
and remove duplicates. Only lowercase queries were used. They removed
pages with a "time-out" of more than 60 seconds. They limited
queries to those with 600 or fewer documents retrieved.
They manually checked that all results were retrieved from each engine
and were parsed correctly. "The engines periodically change their
formats for listing documents and for requesting the next page of
documents," says Giles.
Having made such a splash with this study, they have everything in
place to repeat the study to calculate the Web’s growth, but they
are already at work on a new project. They can’t talk about it yet,
except to say that it concerns a way to more efficiently disseminate
scientific information on the Web. Says Giles: "A very exciting
moment will come soon, in terms of impact of our research."
As experts in Web searches, they offer this advice for extracting
information from what amounts to a 15-billion word encyclopedia.
for NEC Research provides the link to the NEC Research Institute
as the only page returned.
use the major search engines: AltaVista, Excite, HotBot, Infoseek,
Lycos, and Northern Light.
search engine such as MetaCrawler (http://www.metacrawler.com
to get 3.5 times as many documents. Some engines do not delete invalid
documents and so their results are false; some documents have been
changed to delete the term but may still be relevant.
such as Google
and LASER (http://laser.cs.cmu.edu)
for improved ranking of
results. "They make greater use of the structure of Web pages
and the graph formed by the links between pages to determine page
relevancy," explains Lawrence. Google has an efficient ranking
algorithm called PageRank and uses the text found in links to a page
to describe the page. "The links often contain better descriptions
of the pages than the pages themselves," he explains.
: keywords within the title or
URL, documents from a certain date range or geographic area, documents
with specific phrases rather than single terms, etc. Excite uses
clustering and Infoseek uses morphology; both will return documents
with related words. AltaVista returns only capitalized results for
such as Wired Newsbot and
Excite Newstracker for news articles, OpenText for business sites,
and DejaNews (http://www.dejanews.com
) for Internet discussion
groups. AHOY!, a specialized search for home pages, may be able to
find an unindexed home page by going first to a particular university
department and then locating the scientist’s page within that
Even more specific information can be found in the original paper
on Science magazine’s website (http://www.aas.org
. But —
due to a twist of irony — most of us will not find that paper
linked to any document produced by any search engine on the World
Wide Web. The American Association for the Advancement of Science
holds the copyright to the paper and has a six-month embargo policy,
so it is now available only to those who pay $100 for a subscription.
Forget "instant access." It’s back to the old-fashioned method
of corresponding with the scientist who did the work. At least the
mails are quicker. If you send an E-mail to either
you’ll get a reply by return E-mail
— Barbara Fox
, 4 Independence Way,
Princeton 08540. C. William Gear, president. 609-520-1555; fax,
Home page: http://www.neci.nj.nec.com.
Two years after bailing out of his sinking
CD-ROM business, Thynx, Larry Shiller has cut his hair, literally,
and is trying to stick up for the little guys in the world of cyber
Shiller’s home-based company, SBX, has devised a Web-based system
) that gives small brokers an affordable
way to facilitate buying and selling stocks online. Once a broker
is online, the broker’s customers can simply log in to SBX, with its
real time stock feeds and information on a few thousand securities,
and buy and sell stocks directly from the site — without having
to pay broker fees or spreads (the difference between the bid and
"Internet trading is not just for discount brokerages
says Shiller. "The Internet is a way for brokers to improve their
communications with their customers. I believe the Internet will be
the primary means of communication between a brokerage and its
in the future, which means that small brokerages need to be on the
Because the cost of setting up the cyber-infrastructure is
running to hundreds of thousands, possibly millions, of dollars, small
broker/dealers are losing Internet-savvy customers to the large
houses like E-Trade or Schwab. "In terms of the service to private
Internet order entry, small brokers are at a significant disadvantage
because they can’t afford to build a secure website with all of the
firewalls that are required," says Shiller. "For the small
brokers who are seeing a flow of assets out of their accounts, SBX
is a way to offer online trading."
The SBXNet page informs investors that they can place Internet trades
with the broker of their choice — provided their broker is signed
up with SBX. "We have to get brokers to sign up — that’s the
business model," says Simon Blackwell of InfoFirst, the Research
Park-based firm that is managing SBX’s website and doing the marketing
"What we expect to gain from this is order flow," says
"which allows us then to be successful with our second product
line." This is a trade-matching system for over-the-counter
board stocks — typically stocks for companies that are too small
to be listed on any of the major boards like NASDAQ or the New York
These stocks end up in the netherworld of bulletin board pink sheets,
where they typically have very low volumes and liquidity, and where
their performance is controlled by brokers known as "market
"It’s very difficult to create a market for them because nobody
knows they’re there," says Blackwell. "There’s no reasonably
accessible information about the company or what price people are
willing to buy the stock for."
Shiller claims that SBX has the world’s most complete and accurate
up-to-date database of publicly reporting over-the-counter bulletin
board stocks. And, because the order book for the trades is open,
interested stock buyers can see the price histories of their
stocks. "They can see not only the inside market but the breadth
of the market," says Shiller.
The system also automatically matches the buyer and the seller.
we’re really doing is taking the role of the market maker and
it to the investor," says Shiller.
What do market makers think of this? "Any time there
are systems put in place that potential create more liquidity, that’s
great for everybody," says Douglas A. MacWright of FIA Capital,
the market maker for 1st Constitution Bank, the Route 130-based firm
that recently upped the number of its shares on the over-the-counter
bulletin board. "As a market maker I like to see a lot of
in stocks that I follow."
Shiller has a partner, K. Richard B. "Nick" Niehoff, president
of SBX. Known for his work automating the Cincinnati Stock Exchange,
Niehoff reports that SBX will also be able to do what no exchange
does right now — provide instantaneous and accurate quotes for
over-the-counter bulletin board stocks. "You can’t go to a
machine in a broker/dealer that is going to pop back a quote,"
says Niehoff. "This is not a firm quote market in most cases.
In the less-liquid areas of the market you don’t have this
But first, SBX must cross an important hurdle: clearing its concept
with the Securities and Exchange Commission. Last September, the
sent a "no action relief request" to the commission, asking
to be exempted from the same regulations that govern stock exchanges
and associations of brokers and dealers. Niehoff explains that
SBX seems like an exchange, it is not an exchange, since it doesn’t
have a floor nor any of the other amenities that real in-the-flesh
exchanges have, like market makers or screaming mobs or paper
He hopes to hear back from the SEC next month but isn’t surprised
that it has taken this long. "This is cyberspace and it does have
to be reworked into the current rules and regulations that the
has," says Niehoff. Or vice versa, perhaps.
From Cincinnati, Niehoff, 55, is a graduate of the Lawrenceville
(Class of ’61) and of the University of Cincinnati. For 11 years he
was president of the Cincinnati Stock Exchange and founded its
securities trading system, one of the first electronic global stock
exchanges in the world. Niehoff also managed a U.S.
project to implement Poland’s first over-the-counter stock market.
While Niehoff handles the day-to-day operations, Shiller
is responsible for the company’s coffers and ideology Shiller, 44,
grew up in Long Island, the son of a chemist and a registered nurse.
A prodigy with a penchant for music, math, and Wall Street, Shiller
toured the country and abroad as a concert violinist, then went to
MIT and got a BS in math (Class of 1975). For his first three years
out of college, Shiller worked as an engineer for Owens Illinois,
then as a research supervisor for Blue Cross, in Toledo. At night
he played with the Toledo Symphony.
In 1978 at age 24, he started his first business, an accounting
company and service bureau, based in Florida. In 1990 Shiller wrote
a book, "Towards Software Excellence" (Prentice Hall) that
details software analysis and design. He also attended the OPM
program at Harvard Business School in 1996.
He started his biggest venture to date, the Bureau of Electronic
out of his garage in Verona in 1988. Selling entertainment CD-ROMs
with titles like "The Great Kat’s Digital Beethoven on
"Inside the White House," and the Weather Channel’s
Weather," the company started out strong and moved to Parsippany,
where it did a $5 million initial public offering in 1995. Then in
1996 it moved to 619 Alexander Road and changed its name to Thynx
shortly thereafter (U.S. 1, April 17, 1996).
But as the Internet emerged as the medium of the future, sales began
to dwindle and its stock price began to plummet. Shiller ended up
selling the company’s corporate shell to a group participating in
a joint venture in a lucrative Chinese polyester plant and got out
with a "decent valuation," he says (U.S. 1, January 8, 1997).
Shiller is married to Marcelle Soviero, a PR consultant, and the
is expecting their second child. He is still active musically, playing
for the Princeton Chamber Symphony and the Riverside Symphonia,
and is an occasional concertmaster for the Westminster Community
Through SBX, Shiller is satisfying two unfulfilled dreams. First,
he can at long last merge his entrepreneurial talents with the
the nemesis of his previous company. "From that experience it
was clear we needed to find a path to the Internet," he says.
Second, it allows him to toy around with his childhood hobby, Wall
Street, all the while taking a few hacks at some of its inequities.
"The first thing I learned about Wall Street is that it didn’t
work anything like the way I worked," he says. "Wall Street
appeared to be by brokers, for brokers. The investor often got
Even more of Shiller’s idealism is apparent even in the firm’s phone
number, which reflects the year the world’s first stock exchange (the
Philadelphia Stock Exchange) opened, in 1790. "After 200 odd years
there is technology that allows people to do what they never could
have dreamed of," says Shiller. "We’ve solved the problem
of low volume, low liquidity, and high spreads by bringing the free
market to Wall Street."
Now if only that SEC approval comes through.
— Peter J. Mladineo
Home page: www.sbxnet.com.
Krieg, president. 609-683-3800; fax, 609-683-3802. Home page:
Corrections or additions?
This page is published by PrincetonInfo.com
— the web site for U.S. 1 Newspaper in Princeton, New Jersey.