Dick Stenmark, PhD
Göteborg University
Department of Informatics
P.O.Box 620
S-40530 Göteborg
Sweden
stenmark@informatik.gu.se
The goal of this test was to compare the relative quality of the top 10 retrieved documents returned by each of the two search engines - the search engine shipped with Microsoft's SharePoint Portal Server 2003 (hereafter referred to as SPPS) and Verity's Ultraseek Server 5.1 (hereafter referred to as Ultraseek). One of the central problems a search engine must address is not to return as many documents as possible but to sort out the relevant documents and display them at the top of the result list. SPPS ranking algorithm is based on probabilistic relevance scoring (PRS), a technique developed in the 1970's by Robertson and Sparck Jones, two researchers at City University, London [1]. Although maintaining a part time professorship, Robertson is since the late 1990's a Microsoft Research employee. The idea behind PRS is to weight terms differently to differentiate between individual documents in the search set, thereby be able to identify the few truly relevant documents from the many non-relevant ones that still contain some or all of the search terms of the user's query. The terms used for weighting in SPPS are collection frequency, term frequency, document length and position within the document [2].
One of the benefits of the Okapi algorithm according to Microsoft is its extensibility, i.e., the fact that each weighted term is evaluated separately from the others when analysing a document. This makes it possible to extend the algorithm with new weighted terms based on other kinds of information, for example, the time the document was created or last modified. However, this modification is not something the user or the administrator can do. Such modification must be made at product level at Microsoft and is therefore no something with which the customer can experiment. We have no explicit information about the ranking algorithm used by Ultraseek but since collection frequency and term frequency are standard in most information retrieval systems, we assume they are used by Ultraseek. Ultraseek also weights term position and, unlike SPPS, it allows the system administrator to select terms and weights interactively. This means that the customer can choose to favour documents where the sought-for keyword is found in the title field. Whether or not document length is factored in by Ultraseek remains unknown.
Our objective was to test whether there was any significant difference between the the two product's relevance ranking.
This test was carried out at Volvo Information Technology in Gothenburg, Sweden. Volvo has had Ultraseek installed since 1998 and there were approximately 750.000 documents in the index. SPPS was installed for the test and had at time of comparison indexed little less than 1.000.000 documents. It should be noted that although the two engines had not indexed exactly the same content, the intention was to cover the entire intranet and both engines had been set up to crawl as much as they could.
The methodology used for this test is based on Veritest's approch as described in their March 2003 evaluation of five public search engines for the Web [3]. To test the relevance of the ranking algorithms a set of query terms were needed. We therefore examined the search logs from the existing Ultraseek implementation and collected the hundred most frequently used query terms, i.e. queries submitted by actual Volvo employees. Amongst those, we randomly selected 20 terms. Starting with the first we submitted it to each of the two search engines and if more than 10 documents were returned, the query term was selected for the test. If 10 documents or less were returned, the term was discarded. This process was repeated until 10 query terms had been selected. The terms used are listed in Table A below.
Table A. The query terms used in this study, how they should be interpreted, their original position in the Top-100 frequency list, and their absolute frequency based on data from 1 Nov 2003 - 7 Feb 2004.
| Query term(s) | Comment | Position | Frequency |
| parma | Find information about the customer and supplier system "Parma" | 7 | 894 |
| visitkort | Find out how to order business cards | 16 | 447 |
| umeå | Find information about Volvo facilities in Umeå | 26 | 387 |
| bilbiten | Find service and contact information for Bilbiten | 34 | 304 |
| qulis | Find information on the "Qulis" quality system | 45 | 263 |
| forms | Find online forms | 50 | 243 |
| change password | Find out how to change and manage passwords | 81 | 168 |
| motiv | Find information on Volvo's chemicals database "Motiv" | 95 | 158 |
| web access | Find out how to access Outlook via a web browser | 99 | 153 |
| terminalglasögon | Find regulations concerning terminal spectacles | 100 | 152 |
Comments: In table A we account for the query terms exactly as they were submitted. Visitkort is the Swedish word for business card. Bilbiten is the name of a set of five Volvo service facilities within the Göteborg area, where you refuel, get your car serviced, or purchace Volvo accessoars. Terminalglasögon means a pair of spectacles optimised for terminal reading. Volvo employees in Sweden in need of such spectacles may purchase them at a subsidised cost.
These 10 queries were then again submitted to both the search engines and the resulting top-10 documents from each search engine were collected, which meant that we had 200 documents (or URLs) - 100 from each search engine. These URLs where distributed to 20 independent testers who each evaluated 10 URLs. The distribution of the URLs was such that each user had five URLs from each search engine and only one URL per query term. The test user could not determine which URL came from which search engine, nor could they determine the original ranking of any particular URL. Their job was only to judge whether a given document (URL) was relevant given the particular query. To help them determine the relevance of a particular URL, each tester was given the following written questions to ask themselves when making their judgment:
If any of these questions could be answered affirmatively, the document was to be judged relevant [3].
Once the judging process was complete and the evaluation protocols collected, we calculated the raw score and three weighted scores to measure the relevancy of each search engine's ranking. The raw score measured the percentage of relevant documents amongst the top-10 documents and produces a value between 0 (all irrelevant) and 1 (all relevant). The three weighted scores used were used to acknowledge the position of the relevant document. Most users would agree that five relevant documents in positions 1-5 are better than five relevant documents in positions 6-10 even if the relative precision (relevance) is 50% in both cases. We first used the 1/r weight where r is the rank position. A relevant document in position 4 thus gets 1/4 (0.25) points whereas a relevant document in position 7 only gets 1/7 (0.14) points. With this weighting, scores go between 0 and 2.93.
The second weight used was 1/SQRT(r) which gives a flatter scoring curve, i.e., the difference between position 1 and 10 is not as significant as in case 1. A relevant document in position 4 gets 0.50 points and a relevant document in position 7 receives 0.38. The score may thus vary between 0 and 5.02.
Finally, we used the average Document Cut-off Value (DCV), which is a method proven useful in academic studies. The method basically means that you calculate the raw relevance at a certain cut-off value c and then repeat this for a number of different c. Finally, you calculate the average relevance for the whole test. For details, see Hull's paper from 1993 [4]. Scores go between 0 and 1.
The result was consistent across all scoring methods indicating a stabile result. Regardless of scoring method, the results from Ultraseek were found to be twice as relevant as those from SPPS. Of the 100 results reviewed from Ultraseek, 54 were considered relevant, whereas only 26 SPPS' results were judged relevant. This is a significant difference (Ultraseek 108% better) as illustrated in figure A.

Figure A. Raw precision calculated over 100 results (Max=1.0)
If we split the raw results in a position-by-position fashion, we arrive at the diagram in figure B. We see that although Ultraseek dominates the overall result, some queries (such as #4 (manage password) and #8 (forms)) are better answered by SPPS.

Figure B. Raw precision based on 10 results per query
The use of the 1/r weight coefficient produced a similar diagram. Ultraseek received a total of 1.73 points whereas SPPS ended up with 0.83 points of a maximum 2.93 (Ultraseek 108% better, see figure C).

Figure C. Overall weighted (1/r) results for 100 documents (Max=2.93)
When analysing the results by query, as illustrated in Figure D, we see that Ultraseek this time outscores SPPS also on queries #4 and #8. This means that although SPPS returned more relevant answers (illustarted in Figure B), Ultraseek's relevant answers showed up higher in the result list.

Figure D. Weighted (1/r) result per query
Applying the 1/SQRT(r) weight changes only the absolute points but the relationship remains the same as for the 1/r weight, as can be seen in figure E. The scores were 2.82 and 1.38 (out of 5.02) for Ultraseek and SPPS, respectively (Ultraseek 104% better).

Figure E. Overall weighted (1/SQRT(r)) results.
The score per query is shown in figure F below. Note that the results for queries #4 and #8 have shifted again, illustrating the importance of position weighting. The 1/SQRT(r) algorithm is less discriminating against low ranking than is the 1/r weight.

Figure F. Weighted (1/SQRT(r)) per query
The average DCV completed the test confirmed the above pattern. With a score of 0.58, Ultraseek again beats SPPS at 0.30 points (Ultraseek 93% better). As seen in figure G, the proportions are very stable, albeit somewhat less pronounce.

Figure G. Overall DCV results
The average DCV per query, finally, is in Figure H.

Figure H. DCV per query
A practical test of the relevancy of Ultraseek and SPPS was carried out during February 2004 by twenty independent testers at Volvo IT. The test showed that Ultraseek outperformed SPPS regardless of whether raw scoring, 1/r scoring, 1/sqrt(r) scoring, or average DCV scoring was used. Since the results were consistent across scoring algorithms and the difference was significant (Ultraseek results twice as good), our conclusion must be that SPPS fails to improve result ranking.
[1] Robertson, S. and Sparck Jones, L. (1997), "Simple, proven approaches to text retrieval", Tech. Rep. TR356, Cambridge University Computer Laboratory. Available on the web at http://www.cl.cam.ac.uk/Research/Reports/TR356-ksj-approaches-to-text-retrieval.html [October 2004]
[2] Microsoft (2002), "Microsoft SharePoint Portal Server: Advanced Technologies for Information Search and Retrieval", Microsoft White Paper, June 2002. Available on the web at http://www.microsoft.com/sharepoint/server/techinfo/administration/search.asp [October 2004]
[3] Veritest (2003), "Inktomi Corp: Web Search Relevance Test". Available on the web at http://www.veritest.com/clients/reports/inktomi/inktomi_web_search_test.pdf. [October 2004]
[4] Hull, D. (1993), "Using Statistical Testing in the Evaluation of Retrieval Experiments", in Proceedings of SIGIR '93, ACM Press: Pittsburgh, PA, pp 329-338. Available on the web at http://portal.acm.org/citation.cfm?id=160758&dl=ACM&coll=GUIDE [October 2004]
© Dick Stenmark 2004. Go to Dick Stenmark's homepage