Search Engines and E-Commerce

Revving Up the Search Engines: NEC Pair

Because it makes every scientist’s research instantly

accessible to every other scientist, the World Wide Web has speeded

up the progress of discovery in an amazing way. In times past, one

scientist would read an abstract in a journal, then mail a request

for a copy of the full paper, a process that could take weeks or even


Now a scientist can use a search engine to do a word search on a topic

on the World Wide Web and get an immediate response.

It sounds ideal. But two NEC researchers at 4 Independence Way, C.

Lee Giles and Steve Lawrence, have proved that recovery of scientific

information from the Web is not ideal. Their paper "Searching

the World Wide Web," published in the April 3 issue of Science

Magazine, documents in dismaying detail that no single search engine

indexes more than about one-third of the Web pages available.

"Before our study, there was no convincing evidence that the


engines didn’t cover most of the world," says Lawrence. "It

is a rigorous statistical study."

The media — the New York Times, Wall Street Journal, Associated

Press, National Public Radio, MSNBC — latched onto this story

with enthusiasm, and Giles and Lawrence found themselves giving phone

interviewers to reporters all over the world. These two experts in

artificial intelligence and neural networks had, until now, worked

in such high-falutin’ areas as facial recognition patterns, currency

exchange predictions, and natural language processing. Now their


could have an impact on how a fourth-grader did her homework.

Most reporters focused on the "Emperor Has No Clothes" aspect

of the report, the conclusion that search engines are missing most

of the available pages. Lawrence and Giles had searched the Web for

a two-day period in December: They found that HotBot indexed the most

pages (34 percent), followed by AltaVista (28 percent), Northern Light

(20 percent), Excite (14 percent), Infoseek (10 percent), and Lycos

(only 3 percent). A search with all six engines, nevertheless,


only 60 percent of the indexable Web.

But "numbers of pages retrieved" does not tell the whole


and here’s why: A search engine goes out to the World Wide Web,


words on the pages it finds, then stores those indexed pages in its

own archive. When you summon a search engine, it doesn’t roam the

Web to get your answer; it retrieves pages from its archive. So if

an archive isn’t cleaned out regularly, it will bring up links to

useless pages that may not even exist.

Giles and Lawrence found that if accuracy or the immediacy of the

updating is important, low-retrieving Lycos is the best engine —

it had the freshest material, compared to high-retrieving HotBot,

which coughed up the most dead links. Comprehensiveness, they found,

is a tradeoff for freshness.

A less negative way to look at their research is that they found the

Web to be more voluminous than anyone previously estimated, thus the

percentage of the total pages found is smaller than estimated. Giles

and Lawrence claim that the indexable Web (the Web not guarded by

passwords and firewalls) now has at least 320 million pages, compared

with previous guesses of 175 to 250 million.

Soon their phones were literally ringing off the hook: their home

pages (

list more than a full page of links to stories published around the

world. Most reporters were accurate, they say, though they laugh about

how frequently Lawrence’s quotes were attributed to Giles and vice

versa. That’s surprising, even for a dual interview, because Steve

Lawrence’s speech is distinguished by a strong Australian accent.

Nevertheless Giles and Lawrence are alike in many ways:

They share a fervent commitment to basic research, finding out how

things work without needing to immediately apply the knowledge to

an actual project. They are among more than 100 people at 4


Way doing basic research at NEC Research Institute. Another 45 people

at NEC USA C&C Research Laboratories are focussed on applied research.

And three dozen employees do software development for UNIX products

in NEC’s Open Systems Technology Center — also at 4 Independence


"In some sense, it’s our job, to do research and come up with

results," says Lawrence. "The atmosphere here is very


to doing basic research."

"I wake up and say, Isn’t it great I am being paid to do


says Giles. "I think we have fun here.

Giles and Lawrence Bios

Born in Memphis, Tennessee, C. Lee Giles stayed at home to attend

Rhodes College. He earned doctor’s degrees from the University of

Michigan and the University of Arizona. He worked for Ford Motor


Research in Dearborn, Michigan, taught electrical engineering and

computer engineering at Clarkson University, and then worked at the

Naval Research Laboratory in Washington, D.C. Married and with a


daughter, his research areas include machine learning, optics, and

artificial intelligence neural networks.

Perhaps because Lawrence has considerable experience in theater and

improvisational comedy, he is the more talkative of the two. He grew

up in Queensland, Australia, where his father was an accountant, and

went to Queensland University of Technology, earning two undergraduate

degrees in 1993. After finishing his Ph.D. in Queensland, he came

to NEC Research Institute in 1996. "Lee was one of the major


that led me to pursue research as opposed to going into industry,"

says Lawrence.

Why did it take two artificial intelligence experts to do this study?

Why couldn’t two librarians or two statisticians do it just as well?

To do it efficiently Giles and Lawrence had to use NEC’s internal

search engine, Inquirus, which they had previously created. Over a

period of a year, on and off ("mostly off," says Giles) they

took queries that other NEC scientists had made, analyzed them, boiled

them down to 572 queries, wrote a program with restrictive parameters,

and fed it into Inquirus. Then from December 15 to 17 they just



Among their parameters: to count all documents


then count only the documents with the query term exact (not plural)

and remove duplicates. Only lowercase queries were used. They removed

pages with a "time-out" of more than 60 seconds. They limited

queries to those with 600 or fewer documents retrieved.

They manually checked that all results were retrieved from each engine

and were parsed correctly. "The engines periodically change their

formats for listing documents and for requesting the next page of

documents," says Giles.

Having made such a splash with this study, they have everything in

place to repeat the study to calculate the Web’s growth, but they

are already at work on a new project. They can’t talk about it yet,

except to say that it concerns a way to more efficiently disseminate

scientific information on the Web. Says Giles: "A very exciting

moment will come soon, in terms of impact of our research."

Tips on Web Searches

As experts in Web searches, they offer this advice for extracting

information from what amounts to a 15-billion word encyclopedia.

For popular information Yahoo! can be useful. Searching

for NEC Research provides the link to the NEC Research Institute


as the only page returned.

For harder to find or more comprehensive and up-to-date

information ,

use the major search engines: AltaVista, Excite, HotBot, Infoseek,

Lycos, and Northern Light.

Repeat the search on different engines or use a multiple

search engine such as MetaCrawler (

to get 3.5 times as many documents. Some engines do not delete invalid

documents and so their results are false; some documents have been

changed to delete the term but may still be relevant.

Use research search engines

such as Google


and LASER (

for improved ranking of

results. "They make greater use of the structure of Web pages

and the graph formed by the links between pages to determine page

relevancy," explains Lawrence. Google has an efficient ranking

algorithm called PageRank and uses the text found in links to a page

to describe the page. "The links often contain better descriptions

of the pages than the pages themselves," he explains.

Use more specific queries

: keywords within the title or

URL, documents from a certain date range or geographic area, documents

with specific phrases rather than single terms, etc. Excite uses


clustering and Infoseek uses morphology; both will return documents

with related words. AltaVista returns only capitalized results for

capitalized queries.

Use specialized search engines

such as Wired Newsbot and

Excite Newstracker for news articles, OpenText for business sites,

and DejaNews (

) for Internet discussion

groups. AHOY!, a specialized search for home pages, may be able to

find an unindexed home page by going first to a particular university

department and then locating the scientist’s page within that


Even more specific information can be found in the original paper

on Science magazine’s website (

. But —

due to a twist of irony — most of us will not find that paper

linked to any document produced by any search engine on the World

Wide Web. The American Association for the Advancement of Science

holds the copyright to the paper and has a six-month embargo policy,

so it is now available only to those who pay $100 for a subscription.

Forget "instant access." It’s back to the old-fashioned method

of corresponding with the scientist who did the work. At least the

mails are quicker. If you send an E-mail to either">>


you’ll get a reply by return E-mail

or fax.

— Barbara Fox

NEC Research Institute Inc.

, 4 Independence Way,

Princeton 08540. C. William Gear, president. 609-520-1555; fax,


Home page:

Taking Stock in E-Commerce

Two years after bailing out of his sinking


CD-ROM business, Thynx, Larry Shiller has cut his hair, literally,

and is trying to stick up for the little guys in the world of cyber

stock trading.

Shiller’s home-based company, SBX, has devised a Web-based system


) that gives small brokers an affordable

way to facilitate buying and selling stocks online. Once a broker

is online, the broker’s customers can simply log in to SBX, with its

real time stock feeds and information on a few thousand securities,

and buy and sell stocks directly from the site — without having

to pay broker fees or spreads (the difference between the bid and

ask prices).

"Internet trading is not just for discount brokerages


says Shiller. "The Internet is a way for brokers to improve their

communications with their customers. I believe the Internet will be

the primary means of communication between a brokerage and its


in the future, which means that small brokerages need to be on the


Because the cost of setting up the cyber-infrastructure is


running to hundreds of thousands, possibly millions, of dollars, small

broker/dealers are losing Internet-savvy customers to the large


houses like E-Trade or Schwab. "In terms of the service to private

Internet order entry, small brokers are at a significant disadvantage

because they can’t afford to build a secure website with all of the

firewalls that are required," says Shiller. "For the small

brokers who are seeing a flow of assets out of their accounts, SBX

is a way to offer online trading."

The SBXNet page informs investors that they can place Internet trades

with the broker of their choice — provided their broker is signed

up with SBX. "We have to get brokers to sign up — that’s the

business model," says Simon Blackwell of InfoFirst, the Research

Park-based firm that is managing SBX’s website and doing the marketing

and promotion.

"What we expect to gain from this is order flow," says


"which allows us then to be successful with our second product

line." This is a trade-matching system for over-the-counter


board stocks — typically stocks for companies that are too small

to be listed on any of the major boards like NASDAQ or the New York

Stock Exchange.

These stocks end up in the netherworld of bulletin board pink sheets,

where they typically have very low volumes and liquidity, and where

their performance is controlled by brokers known as "market


"It’s very difficult to create a market for them because nobody

knows they’re there," says Blackwell. "There’s no reasonably

accessible information about the company or what price people are

willing to buy the stock for."

Shiller claims that SBX has the world’s most complete and accurate

up-to-date database of publicly reporting over-the-counter bulletin

board stocks. And, because the order book for the trades is open,

interested stock buyers can see the price histories of their


stocks. "They can see not only the inside market but the breadth

of the market," says Shiller.

The system also automatically matches the buyer and the seller.


we’re really doing is taking the role of the market maker and


it to the investor," says Shiller.

What Market Makers Say

What do market makers think of this? "Any time there

are systems put in place that potential create more liquidity, that’s

great for everybody," says Douglas A. MacWright of FIA Capital,

the market maker for 1st Constitution Bank, the Route 130-based firm

that recently upped the number of its shares on the over-the-counter

bulletin board. "As a market maker I like to see a lot of


in stocks that I follow."

Shiller has a partner, K. Richard B. "Nick" Niehoff, president

of SBX. Known for his work automating the Cincinnati Stock Exchange,

Niehoff reports that SBX will also be able to do what no exchange

does right now — provide instantaneous and accurate quotes for

over-the-counter bulletin board stocks. "You can’t go to a


machine in a broker/dealer that is going to pop back a quote,"

says Niehoff. "This is not a firm quote market in most cases.

In the less-liquid areas of the market you don’t have this


But first, SBX must cross an important hurdle: clearing its concept

with the Securities and Exchange Commission. Last September, the


sent a "no action relief request" to the commission, asking

to be exempted from the same regulations that govern stock exchanges

and associations of brokers and dealers. Niehoff explains that


SBX seems like an exchange, it is not an exchange, since it doesn’t

have a floor nor any of the other amenities that real in-the-flesh

exchanges have, like market makers or screaming mobs or paper


the floor.

He hopes to hear back from the SEC next month but isn’t surprised

that it has taken this long. "This is cyberspace and it does have

to be reworked into the current rules and regulations that the


has," says Niehoff. Or vice versa, perhaps.

From Cincinnati, Niehoff, 55, is a graduate of the Lawrenceville


(Class of ’61) and of the University of Cincinnati. For 11 years he

was president of the Cincinnati Stock Exchange and founded its


securities trading system, one of the first electronic global stock

exchanges in the world. Niehoff also managed a U.S.


project to implement Poland’s first over-the-counter stock market.

Shiller and Niehoff Bios

While Niehoff handles the day-to-day operations, Shiller

is responsible for the company’s coffers and ideology Shiller, 44,

grew up in Long Island, the son of a chemist and a registered nurse.

A prodigy with a penchant for music, math, and Wall Street, Shiller

toured the country and abroad as a concert violinist, then went to

MIT and got a BS in math (Class of 1975). For his first three years

out of college, Shiller worked as an engineer for Owens Illinois,

then as a research supervisor for Blue Cross, in Toledo. At night

he played with the Toledo Symphony.

In 1978 at age 24, he started his first business, an accounting


company and service bureau, based in Florida. In 1990 Shiller wrote

a book, "Towards Software Excellence" (Prentice Hall) that

details software analysis and design. He also attended the OPM


program at Harvard Business School in 1996.

He started his biggest venture to date, the Bureau of Electronic


out of his garage in Verona in 1988. Selling entertainment CD-ROMs

with titles like "The Great Kat’s Digital Beethoven on


"Inside the White House," and the Weather Channel’s


Weather," the company started out strong and moved to Parsippany,

where it did a $5 million initial public offering in 1995. Then in

1996 it moved to 619 Alexander Road and changed its name to Thynx

shortly thereafter (U.S. 1, April 17, 1996).

But as the Internet emerged as the medium of the future, sales began

to dwindle and its stock price began to plummet. Shiller ended up

selling the company’s corporate shell to a group participating in

a joint venture in a lucrative Chinese polyester plant and got out

with a "decent valuation," he says (U.S. 1, January 8, 1997).

Shiller is married to Marcelle Soviero, a PR consultant, and the


is expecting their second child. He is still active musically, playing

for the Princeton Chamber Symphony and the Riverside Symphonia,

and is an occasional concertmaster for the Westminster Community


Through SBX, Shiller is satisfying two unfulfilled dreams. First,

he can at long last merge his entrepreneurial talents with the


the nemesis of his previous company. "From that experience it

was clear we needed to find a path to the Internet," he says.

Second, it allows him to toy around with his childhood hobby, Wall

Street, all the while taking a few hacks at some of its inequities.

"The first thing I learned about Wall Street is that it didn’t

work anything like the way I worked," he says. "Wall Street

appeared to be by brokers, for brokers. The investor often got


Even more of Shiller’s idealism is apparent even in the firm’s phone

number, which reflects the year the world’s first stock exchange (the

Philadelphia Stock Exchange) opened, in 1790. "After 200 odd years

there is technology that allows people to do what they never could

have dreamed of," says Shiller. "We’ve solved the problem

of low volume, low liquidity, and high spreads by bringing the free

market to Wall Street."

Now if only that SEC approval comes through.

— Peter J. Mladineo

SBX Normandy Court, Larry Shiller. 609-466-4005;

Home page:

InfoFirst 14 Wall Street, Princeton 08540. Walter

Krieg, president. 609-683-3800; fax, 609-683-3802. Home page:

