Search Tool Data Analysis

by Roopak Pati (roopakroopak in BIT330, Fall 2008)

Questions and queries

Web search engines

How much oil does the U.S. consume monthly on average? The objective is to get a general idea of this figure rather than a specific number for a particular month. Data that would allow calculation of monthly consumption (i.e. daily consumption) would also be helpful.

Google Query: U.S. monthly oil consumption
Yahoo Query: U.S. monthly oil consumption
Live Query: U.S. monthly oil consumption

Blog search engines

Who supplies oil to the U.S.? This objective of the question is simply to find countries that supply oil to the U.S. Any source that gives at least one country would be considered useful.

Bloglines Query: U.S. oil supplier
Google Blog Query: U.S. oil supplier
Technorati Query: U.S. oil supplier

Data that I collected

Search engine overlap data

Web search Live Google Yahoo Web
Live 60 30 35
Google 40 35
Yahoo Web 50
All 20
Blog search Technorati Google Blog Bloglines
Technorati 10 0 0
Google Blog 60 5
Bloglines 55
All 0

Search engine ranking overlap data

This table provides a measure of how much of Google's responses are reproduced by Yahoo.
GY Yahoo
Google 5 10 20
5 0 0 1
10 0 1 2
20 2 4 7
This table provides a measure of how much of Yahoo's responses are reproduced by Google.
YG Google
Yahoo 5 10 20
5 0 0 2
10 0 1 4
20 1 2 7
This table provides a measure of how much of Blogline's responses are reproduced by Google Blog Search.
BG Google
Bloglines 5 10 20
5 0 1 1
10 0 1 1
20 0 1 1
This table provides a measure of how much of Google Blog Search's responses are reproduced by Bloglines.
GB Bloglines
GBlog 5 10 20
5 0 0 0
10 1 1 1
20 1 1 1

Results

Web search

Overlap of Search Engines

This table provides the class data for the Overlap of Search Engines:

Live Google Yahoo L/G L/Y G/Y L/G/Y
Mean 42.8% 54.4% 51.7% 18.3% 20.0% 20.6% 10.0%
Median 42.5% 57.5% 52.5% 20.0% 20.0% 20.0% 10.0%
Mode 15% 70% 70% 10% 10% 25% 10%
Std. Dev. 22.8% 20.1% 22.4% 9.5% 11.4% 7.8% 7.5%

The columns with one search engine in the heading provide the precision statistics. The columns with two or three provide statistics about the overlap.


These tables provide the class data for the Overlap of Rankings:

GY
o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,20) o(10,20) o(20,20)
Mean 1.1 1.4 1.6 1.3 2.0 2.6 1.6 2.5 3.7
Median 1.0 1.0 2.0 1.0 2.0 3.0 1.0 3.0 4.0
Mode 1.0 0.0 0.0 1.0 1.0 4.0 1.0 3.0 5.0
Std. Dev. 1.2 1.3 1.4 1.2 1.3 1.7 1.2 1.5 2.1
YG
o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,20) o(10,20) o(20,20)
Mean 1.1 1.2 1.6 1.5 1.9 2.5 1.9 2.6 3.8
Median 1.0 1.0 1.0 1.0 2.0 3.0 2.0 3.0 4.0
Mode 1.0 0.0 1.0 1.0 3.0 3.0 1.0 4.0 5.0
Std. Dev. 1.2 1.3 1.4 1.2 1.4 1.6 1.3 1.7 2.1

o(a,b) is the overlap between the top a results of the left search engine with the top b results of the top search engine.

Blog search

This table provides the class data for the Overlap of Blogs:

Techn Gblog BlogL T/G T/B G/B T/G/B
Mean 33.1% 52.5% 44.4% 3.6% 9.2% 6.9% 1.4%
Median 30.0% 42.5% 47.5% 0.0% 7.5% 5.0% 0.0%
Mode 35% 40% 50% 0% 5% 5% 0%
Std. Dev. 21.2% 22.2% 14.3% 7.0% 7.7% 6.4% 3.3%

The columns with one blog in the heading provide the precision statistics. The columns with two or three provide statistics about the overlap.


These tables provide the class data for the Overlap of Rankings:

GB
o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,20) o(10,20) o(20,20)
Mean 0.3 0.4 0.5 0.4 0.5 0.8 0.7 0.8 1.1
Median 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
Mode 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Std. Dev. 0.5 0.6 0.6 0.6 0.7 1.0 0.9 1.1 1.2
BG
o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,20) o(10,20) o(20,20)
Mean 0.3 0.4 0.6 0.4 0.5 0.8 0.5 0.9 1.1
Median 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0
Mode 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
Std. Dev. 0.5 0.6 0.9 0.6 0.7 1.1 0.6 1.0 1.2

o(a,b) is the overlap between the top a results of the left search engine with the top b results of the top search engine.

Discussion

Web search

The search tool data statistics allow us to create a ranking of the search engines based on precision. Though all three are close, the mean and median indicate that Google is the most precise, followed by Yahoo, with Live being the least precise. This ranking is, of course, only based on our class results and should not be taken as concrete. The mean is important in this finding because it provides the average precision among the class's searches. The median allows us to factor out the effect of outliers on the mean by providing us with the middle number. The means with respective ranking as can be seen above are: 54.4%, 51.7%, and 42.8%. Yahoo and Google seem to be close in preciseness, but Live seems to be significantly worse. I find it interesting that this is the same order as their market share.

The overlap data points out that there are many differences between searches in Yahoo and Google. For instance, the mean of the appearance of a top 5 result of one in the top 5 results of the other is equal at 1.1. This shows that usually about one of the top 5 in Yahoo will appear in the top 5 of Google. Unfortunately, we do not know whether that one result tends to be relevant. However, with such a small overlap and precisions of 54.4% and 51.7%, it can easily be seen that there is much to gain from searching on both Google and Yahoo rather than just one.

The best recommendation that can be made to the average searcher, who typically looks at the first 5 results or less, is to use both Google and Yahoo. The mode indicates that most often only one of the results found in this scope will be found in both. By looking at both, a searcher will drastically increase the number of sources found. I had previously assumed that most search engines would provide approximately the same sources in a similar order. I had not known about the great differences in the way search engines conduct searches. There is room for further investigation by taking into account the relevance of the overlapping sources. This will indicate whether the relevant sources typically tend to be on multiple search engines, or whether both search engines simply have similar flaws.

Blog search

The blog data statistics indicate a greater separation between blog searches than search engines based on precision. The ranking which forms from the mean data puts Google Blog first, followed by Bloglines, with Technorati being the least precise. The means are 52.5%, 44.4%, and 33.1%, with respect to ranking. However, when the medians are looked at, which as mentioned above are not affected by outliers, the ranking appears to have Bloglines as the most precise blog search, with Google Blog second and Technorati third. The standard deviations are high, indicating that the data is relatively spread out, so the median is probably the better statistic to look at in this case.

The overlap statistics show very little overlap between the results with Google Blog and Bloglines. As indicated by the modes, there is most often no overlap. This shows that blog searches either search very differently from each other or focus on different pools of results. With such a small number of overlaps, it is not worth it to look more closely at the top 5 or 10 results, because there is not enough data to draw any significant conclusions. My recommendation for someone searching blogs would be to definitely look through multiple blogs and to start with Bloglines. I learned, as with search engines, about how differently blog searches are from each other. My recommendation for further analysis would be the same as my recommendation for search engines given above.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License