Remember me
NI Embedded Webcast Leader

Search: the next generation

Published on 21 June 2008

Search the magazine archive



By Chris Edwards

Next generation

The more we come to rely on search engines, the more we encounter their shortcomings. With public searchers like Google lurching toward ever-more commercial business models, should organisations now consider building their own search engines? We find out.

“Everyone thinks search is typing keywords into a box,” says Mike Lynch, the head of search specialist Autonomy. “But that is going to change.” the science of Web searching has reached a crucial stage in its development: for proof just Google ‘Web search future’. Lynch reckons that, as the mechanics of search head into the deepest recesses of the it system, it will steadily reduce our dependence on Google as a primary Web inquisitor, and make that box fade from our memories.

Lynch proposes the idea of implicit search, where the computer analyses what users are doing, and then brings up relevant information. It is a future that search-engine pioneer Karen Spärck Jones pointed to in her final lecture before her death last year. Her work, which used statistics to give computers a better idea of how language works, led her to believe that there was another layer that could sit on top of those of the operating system, the utilities, and the applications: the information layer. This would use search technologies to make the underlying it work more for the user than the reverse.

For Autonomy’s Lynch, meanwhile, the future of search lies in applications such as an early-warning system for trouble and way of redressing the balance between people and their computers. “In the 1960s, people thought it would be good to get computers, but we had to make the world simple for the business – we had systems in which the position of a number told the computer what something meant,” he explains. “If the number in column three, for example, went down, the system could order more parts. But if it can understand the meaning of the human input, the computer can fit to how business is done.”

In this type of system, the servers might look at emails and messages coming in from customers through a helpdesk, and watch for changes in the type of problem. By determining how that traffic differs from the previous day, it may flag up that a bad batch of products has been delivered. “It might be a tiny percentage of traffic that is different,” Lynch says, “but, it may contain harbingers of bigger things about to happen.”

Others are more sceptical of what implicit search might do. “I suspect that implicit search may madden people more than it delights them,” says Stephen Robertson of Microsoft Labs in Cambridge, who worked with Spärck Jones on the techniques that still underpin many search engines.

“Understanding what defines the current context of your task is quite hard; although i am sure there are circumstances in which it might be useful. One colleague installed an experimental system a few years ago that used implicit search. Shortly after installing it, he was replying to an email, and it brought up the reply he had sent and forgotten about.”

But, Robertson maintains, the situation could be like autocorrection in word processors: “half of the time, i think it is great – and half of the time I think: ‘Why did it do that?’”

Damned statistics

Search is poised to potentially change the way that computers operate, where work on language processing in other areas seems to have failed. It’s all in the mechanics of the search engine. the surprising thing about search technology is that it doesn’t need to understand language. Trying to get information retrieval systems to deal with grammar and structure has sent many of them up blind alleys. Very early work on search technology led people to try to get computers to understand grammar.

“In reality, that has been a commercial failure,” Lynch admits. Why? “It’s because the world is very complicated.”

Time and again, statistical models that do not even try to parse grammar have turned out to be far more successful. Robertson says: “one of the success stories in the last ten years has been the statistical language models. They don't try to address syntax or levels of structure; but they do reveal a level of organisation that is in the statistics.”

If actions based on implicit search do become a reality, Robertson reckons statistics will play a key role: “there are certainly opportunities for systems that do implicit things on the basis of what they are observing. I am sure that statistics are vitally important in that, in the same way that statistics have become so important in our understanding of language. The people who worked on formal grammars would find it quite extraordinary.

Robertson cannot see any kind of implicit action working without any statistical ideas of what is normal and what is abnormal: “It will come from a combination of logic and statistics,” he says.

Search contenders

The other reason for much of the work undertaken by the search industry is starkly commercial: these are the applications that are getting the R&D money right now. The advertising machine behind the public search engines is helping it to fund a battery of projects, many of which use some kind of language processing to serve up even more ads to the consumer. Google tries to show you ads that are relevant to emails or documents that you view in a browser.

However, individual companies are themselves getting in on the search game – and this may provide Google’s biggest competition. Enterprise search has got off to a slow start. it means there are thousands of documents lying inside the corporate firewall that people cannot find because they do not have effective internal search facility. Says Charlie hull of Lemur Consulting: “People are asking: ‘if the internet knows, then why don't we?’”

So, companies are now installing search engines – which are providing the money for the next phase of development. The technology in Microsoft’s sharepoint, for example, underpins the personal search engine in the company’s vista operating system. Apple Computer has made its own investment in the spotlight engine. And then, at the intranet level, the enterprise search vendors are moving in. Within the enterprise, the search box still rules but, with the ability to deploy software agents to servers and desktops, the idea of rolling implicit search into the machine becomes more feasible.

For a provider like Google to do it, you have to agree to do everything on its servers, and give the company access to what might be sensitive material. Therefore, the things that drive enterprise search are privacy and security, coupled with an incredible volume of documents.

Going public

Lemur’s Charlie hull says: “People are generating documents at such a rate that, if you don’t have millions of documents today, you will have tomorrow.” Hull claims that people are now so used to using services such as Google to find things, they expect to be able to do the same with the information held by their own employers. And they often uncover ways to find information from public rather than private sources.

“In the old days if you wanted to look at a phone number you looked at your own phone list,” says Hull, “but if your own system is slow and painful, you will use Google.”

 Enterprise search is about reversing that trend. To get information like phone numbers into a search engine, the software needs to understand all manner of different data formats. Companies use many different forms of file formats – some of them custom – and they want to plug search into email and messaging. Enterprise search engines need to be able to handle customisation.

“A typical large company has of the order of 9,000 information repositories,” opines Lynch. “In there, you have some 400 different types of repository, and a thousand file types.”

The multiplicity of data formats means that you may have to pay the search-engine supplier to develop the code needed to index the information in your custom files and databases or use an application programming interface (API) to do the job yourself. This is where people such as hull believe open source could have an advantage.

Potentially, you can build much tighter links between your data and the search engine and also benefit from the converters that people have developed for various data formats that have been passed back into the community. Lynch sees an opening for open source among smaller companies “provided you like programming”.

A common theme among enterprise search users is the need for security. Most enterprise search tools on the market are able to hide documents from users who do not have the right access privileges. This is not something you can get from public search engines.

A further incentive for getting users onto the internal search system comes from the evolution of enterprise search engines to support federated search. Users often call for the ability to search both inside and outside the company. This entails performing federated searches or using metasearch engines – tools that collate results from many different search products.

The federated search is one that may make the public search engines less important as it allows search engines to cooperate on a task. The problem here is one of standardisation: that is moving slowly, but there is an active specification, called OpenSearch. If lots of little search engines are able to work together, users may slip away from the monolithic services provided by Ask, Google, and Microsoft.

Astrium 1 NN Horizontal August 2008

Comments

All comments

You need to be registered with the IET to leave a comment. Please log in or register as a new user.

Toolbox

Comments on this article

  • Case studies: search in use
    For Jane Bradbury, the knowledge management director at legal firm Field Fisher Waterhouse, enterprise search has taken over from the knowledge-management tools she used to work with. “When I started, I did an audit of how we use knowledge within the business. As I started to look at traditional knowledge-management systems, I then stumbled across enterprise search. We were very much starting from scratch and there were no installed knowledge-management systems.

    “I was impressed with the advances made in the software,” she says. “Using search, we wouldn't have to employ specialists; so, we are using it as our knowledge-management tool. It puts knowledge management in the context of the business.”

    The legal community in general has become one of the key markets for enterprise-search vendors because the business of law depends so heavily on stored information: case law, precedents and government policy all have to be taken in account by a lawyer working on a case. In turn, it is influencing how search engines are deployed – it is important not to search only within the company but external databases as well.

    Using the Recommind search engine, Bradbury said the company is moving to external searches gradually. “We want to introduce that quite slowly: we wanted to do something that would make a difference and… work,” she notes. However, the system, which was deployed in three months in the summer of 2007, can find information in one of the third-party legal databases that the firm licenses.

    Charlie Hull, of search-engine specialist Lemur Consulting, says some legal firms have had to hook their internal search software into some 15 different legal databases. “In the law sector, we often get asked to build web spiders and create metasearch software,” he claims.

    Thierry Menard, knowledge management manager at Bureau Veritas, says his company needed a federated search engine, in this case Polyspot Enterprise Search, to handle the many different sources and technologies that the indexer has to reach into. And the certification-service provider will add the ability to search external sites next year.

    At the Scottish Government, a single engine from Exalead underpins the search on both the internal and publicly-accessible websites. “There are five sites in all,” says Doug Campbell, e-search project manager for the Scottish Government.

    On the Intranet alone, Campbell adds: “There is a tremendous amount of stuff. And we wanted to provide the sophisticated search functions and features that people are now used on the external websites. The product works: you put in one or two terms and it is pretty good at coming up with relevant terms.”

  • Hooks: search inwardly, search deep
    Looking at the number of researchers working at Google, it is easy to believe that public search engines have the upper hand in terms of technology.

    They do indeed have some key advantages. Enterprise search has several problems to overcome but, in dealing with those obstacles, the technologies could become as powerful as those used in public search engines or even surpass them. The biggest obstacle to enterprise search is the lack of scale.

    “The chaos of the Web makes for ways in which Web retrieval is easier. It is the result of the behaviour of lots of people,” Stephen Robertson of Microsoft Labs in Cambridge explains. “In more controlled environments it is more difficult because the numbers are not so large; and the control that people try to exercise is sometimes counterproductive.”

    To deal with the scale of the Internet, the public search engines rely heavily on links not just to crawl through the billions of documents, but also to index them. “One of the most powerful clues for the search engine is anchor text,” Robertson says.

    The hyperlinked text in a document provides the search engine with a very good clue of what the target document is about. It has famously been defeated in Google-bombing campaigns, where people team up to put misleading text in the anchor – you can still find plenty of references to the attempt to link George W Bush to the text ‘miserable failure’, although Google has dealt with the original problem. Spam also attempts to subvert search engine behaviour through anchor text. But, for the time being, public search engines can make use of it.

    Inside the enterprise, useful anchor text is hard to find: it only exists in any quantity on intranet pages. Enterprise search engines rely much more on the text inside the files themselves, which complicates the job of finding relevant documents to a query; but work on statistical models is improving the hit rate of enterprise-search engines to the point that it become counterproductive to use links provided by a content-management system or user-defined tags to try to provide hints to the indexer.

  • Jargon buster: Kamau Austin’s search engine terminology guide

    Best Practice SEO AKA ‘white hat’ SEO, this is studying the search engine algorithms, and using accepted and standard SEO practices to create outstanding sites with relevant content, to boost the credibility of the search engines.

    Internet directory A large listing of categorised websites – however, the concept that should be understood with the Internet directories is that they actually have editors that decide what goes into the directory.

    Link popularity Usually referred to as the number of links that point to your site from other sites (or inbound links).

    Off-the-page factors Optimisation techniques like signing-up for links from Internet Directories, reciprocal linking campaigns, writing articles with active inbound links, buying links from sites with a high PageRank from Google, or sending press releases with Online Wire Services with good link popularity to boost your website’s search engine rankings.

    On-the-page factors Things like the SEO copy-editing, interface, and Web design, HTML coding and site layout, linking structure, or anything that you do on your actual site to make it more search engine friendly for higher rankings.

    PageRank Google’s off the Page algorithm for determining the link popularity and overall rank of your website.

    Pay-per-inclusion Paying to be included in the database of a search engine or Internet directory.

    Pay-per-click Advertising with the search engines is bidding for particular keyword phrases or search terms used most frequently by Internet users related to information on certain niches and sectors.

    Reciprocal linking The practice of websites exchanging links to increase each other’s link popularity, and for the more sophisticated Web marketers, their PageRank.

    Search Engine Database of websites that is ranked according to the computerised criteria that the programmers decide upon called an algorithm. Various search engines determine ranking on their own different factors of importance or relevance.

    Search engine algorithm Sets of rules according to which search engines rank Web pages. Figuring out the algorithms is a major part of SEO. The thinking is that if you understand how they calculate relevance, you can make specific pages on your site super-relevant for specific search terms.

    Search engine copy-editing Merges the traditional sales copyediting skills of direct mail marketing, and the ‘on the page factors’ of writing in a way that concurrently gets good Search Engine rankings.

    Search engine indexing The process of when a search engine sends out an Internet spider to follow and then download the new pages of the Internet into its index. Basically a database of the text content and links of your site is recorded by the search engine or indexed. In other words when your site is indexed by a Search Engine it is catalogued for retrieval by Internet searchers.

    SEO (search engine optimisation) SEO promotion is using known conventions and, in some cases, deconstructing the algorithms of the developers of the search engines and working with them.

    Search engine ranking The position of your site in the search results of the Search Engines for particular keyword phrase or search term. Ranking results are shown on the SERPs in descending order from the number #1 ranked site on down.

    Search engine results page (SERP) The search results pages that you receive after you type into the search box of the search engine your keywords phrases or search terms. The first SERP usually gives you the results ranked according to their algorithm and also the number of sites returned from your search.

    Search engine spam AKA ‘black hat’ SEO is using tactics that actually hurt the relevancy of the search engines to achieve high-ranking websites. This includes tactics like cloaking, hidden text, doorway pages, keyword stuffing, or redirects to erode the efficiency of the search engines to deliver good results to their end users searching for information.

    Search engine spider An automated program that forms part of a search engine. Its task is to surf the Web by following links from one page to the next and from one site to the next. It collects information from the sites it visits and that information is stored in the search engine’s database.

    Search engine submission The act of signing up to get listed in the search engines. This is a time consuming task that should be done to get listed at some of the search engines or Internet directories.

    Search terms (AKA keywords or keyword phrases) Search terms (and more specifically keyword phrases) are words searchers put in a search box to find information on a particular product, service, or item. Keywords and keyword phrases have different tiers.

    • Kamau Austin is the author of the book ‘Always On Top' (AdPro Media Sales Publishing), and the owner of www.searchengineplan.com.