Oct
28th

About the I-Spy Meta Search Engine

Files under WWW | Leave a Comment | 28 Saturday October 2006

Submitted to ACM Computing Reviews October 2006

 I-Spy Logo

A Review of the Paper:

Smyth, B. and Balfe, E. 2006. Anonymous personalization in collaborative web search. Inf. Retr. 9, 2 (Mar. 2006), 165-190. DOI= http://dx.doi.org/10.1007/s10791-006-7148-z

I remember as of 1998 meta-search engines such as the Meta Crawler were very popular among students and the research community until the arrival of Google that revolutionized the concept of searching on the WWW. Now meta-search engines are practically not very popular since the dominion of Google, Yahoo, AlltheWeb and Windows Live search engines. Also not much has happened in the search engine arena from the technological perspective since Google’s Page Rank algorithm. Some interesting alternatives have spawned such as the Lexxe search engine based upon advanced natural language processing, domain-specific search engines such as the Kosmix Health search engine, Clusty the clustering search engine or BrainBoost the somehow intelligent question answering search engine.

Well, in this paper it is depicted in detail a proof-of-concept innovative meta-search engine entitled “I-Spy”. We may claim it as some-how revolutionary in the sense that it is based upon or query (qi) x document (dj) matrices storing the number of hits on each cell (Hi,j), which reminds us of the typical termxdocument matrices used in information retrieval algorithms. represent search histories of previous community users. uses (CBR) technology for displaying search results against a given query. In this case, the basic philosophy of CBR is the reuse of successful previous searches for the solution of future queries that present certain similarity. For each submitted query retrieves a preset maximum number of results per search engine, recombining results as a list ordered by normalized increasing overall scores (Rm).

Next, re-ranks this ordered list based upon the selection history of previous searches (Rm is converted into RT). Results that are relevant to the current query are promoted (re-ranked with a higher score) in the list. Also relevant previous results that are included in the hit matrix (Hi,j) and not included in RT are finally included in RT and promoted.

maintains separate hit-matrices for separate communities limiting their growth as compared to termxdocument matrices. We are able to intuitively foresee matrices with probably large number of rows (queries) but limited number of columns (documents) given that normally users only select the most relevant results (documents) to a query (i.e. users select documents among the first 25 results).

We might be able to critize their implementation not taking advantage of parallel processing when users submit queries for determing a-priori not only (meanwhile waiting for results from search engines) but also the most relevant top-k community enabling naïve users to submit generic queries to all communites and letting the user to choose which community is appropiate for his interests (i.e. community discovery).

We conclude this review stating that European Universities are not really competent in transferring technology to the industry sector loosing clear market opportunities, as compared to American Universities. The European Community should not only promote Network of Excellence (NOE’s), Specific Targeted Research Projects (STREPS) or Integrated Projects (IP’s) but also smaller innovative entrepreneur incubator projects. We believe there is a huge pool of proof-of-concept projects that can be transferred from the academia to the entrepreneur sector potentially generating alternative funds for Universities in the form of royalties, licenses and company shares.

Companies such as Eurekster are trying to fill the niche market of community-based search engines. Their beta proposal named as Swicki learns the search behaviour of communities by enabling community users the promotion or exclusion of web sites and pages and focusing search on domain-specific sub-webs.

 

NOTE: Formula (5) page 174 has a small errata: Relevance (pj,pi) should be Relevance(pj,qT)

Sphere: Related Content

TAGS:[ , , , , , , ]

Related posts

Print This Post Print This Post

There Are No BibTex Entries For This Post

Hits for this Post:7346 | Posted in WWWPosted by VirgoBrain | No Comments »


1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...
No Comments Add Comments



Oct
20th

Speculating about the Semantic Web: Where is the Data?

Files under Semantic Web, WWW | Leave a Comment | 20 Friday October 2006
Where Is the Data: Current Estimates
Having characterised the current situation of the WWW we are interested in knowing how much information and particularly information is available on the WWW as of October 2006. Since the publication of one of the very first works by Tim Berners-Lee explaining and mentioning the [20] (circa 2001), not much have been achieved in the WWW. As of the end of 2006 the Web is still characterised by the predominance of HTML and its more advanced dialects such as XHTML.

Sampling Google: the versus the
We might be able to arrive to some simple conclusions by sampling the Web, querying one of the most popular and biggest search engines available on the WWW. As of September of 2005 it was claimed on the WWW that Google indexed roughly 24 billion entries in its index [21], achieving 65% more results for a random query than its nearest rival, which was Yahoo at the time being [22]. We believe these figures might be distorted and we rather consider as more realistic the study size proposed by [15] as indicated in (9):

WWW Size Jan 2005 * 1.14 increase = 13.11 billion of pages as of Jan 2006 (as stated in previous post Speculating About Internet/WWW Demographics and Size)

Sampling HTML resources
It seams that the Google Search Engine does not allow the definition of plain wildcard expression such as the following in order to find the number of resources indexes pertaining to a given file type (as far it is indicated on the Google search help pages):

filetype:

A way to overcome this impossibility is to include a negative query, or a query that retrieves the entries that do not contain a given string. By entering a strange string combination we are able to simulate the wildcard query. The following queries gives us a rough estimate of the number of entries as compared to plain html. The as of 10th October 2006 06:15 GMT+01 Madrid gives us the following responses:

Sampling HTML Data in the

Q1:html filetype:html returns 7,510,000,000 entries or 7.51 billion entries We might consider this query as an upper bound query. We are following the same procedure as in [23]

Q2: Negative Query, give the number of html files that do not contain the “impossible” string combination -ñ^^ñ^^ñ^^ filetype:html returns 2,500,000,000 entries or 2.5 billion entries

Please note we believe that the Google Search Engine here, does not interpret the "^" as raise to a power. We take this query as a possible approximation of the number of HTML entries in the .

Sampling triples

Q3: filetype: returns 35,600,000 entries or 35.6 million entries

We believe in this case Google returns a super-optimistic upper-bound query, that was taken into account in [23]. We conclude that the estimates returned by Google by using the filetype operator are not consistent. Therefore given the lack of information from Google Web pages [25] and the lack of a consistent interpretation we have to acknowledge that these are merely speculative estimates.

Q4: : filetype: returns only 34,300 entries or 34.4 k entries

This query represents the number of entries with the qualified namespace declarations. We may consider this query as a very low bound query.

Q5:Negative Query, give the number of files that do not contain the "impossible" string combination. (i.e. The qualified name reversed) -drf:DRF filetype: 1,780,000 entries or 1.78 million entries

We take this query as the most possibly realistic scenario of file-based repositories on the WWW.

As we can see the files represent a minimal (1,78 million / 2,5 billion) 0.0712 % of the html entries. Given that Google covers around 76.2% of the total size of the Web [15], and given our previous estimate of the Web according to World population connectivity as of January 2006 (9):

WWW Size Jan 2005 * 1.14 increase = 13.11 billion of pages as of Jan 2006 (as stated in previous post Speculating About Internet/WWW Demographics and Size)

The possible index size of Google as of January 2006 would be: 13.11 billion pages x 0.762 = 9.98982 billion pages ˜ 10 billion pages (10)

The percentage of data represented mainly by files in Google, would be possibly be the following:

1,780,000 entries / 9.98982 billion entries Google = 0.0178 % for the most realistic scenario (11) 35,600,000 entries / 9.98982 billion entries Google = 0.356 % for the most optimisitic scenario (12)

We acknowledge that all these estimates are some-how speculative. Nevertheless we are capable of discerning that the current presence of the on the WWW is minimal or quasi non-existent as compared to html and sources.

For instance the following query:

Q6: filetype: returns 480,000,000 entries or 480 million

Which even when considering both upper bound queries in Q3 and Q6, the data available on the WWW represents only a fraction of 7.4% of data available on the WWW. In other words, as of October of 2006, there might be around 13.5 times more data on the WWW than .

References
 [15] Gulli, A. and Signorini, A. 2005, "The indexable web is more than 11.5 billion pages", In Special interest Tracks and Posters of the 14th international Conference on World Wide Web (Chiba, Japan, May 10 - 14, 2005). WWW ‘05. ACM Press, New York, NY, 902-903. DOI= http://doi.acm.org/10.1145/1062745.1062789

 [20] T. Berners-Lee, J. Hendler, and O. Lassila, "The ", Scientific American, vol. 284, no. 5, 2001, pp. 34—43 Available at: http://www.sciam.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21

 [21] Tristan Louis Blog, "Google has 24 billion items index, considers MSN search nearest competitor", September 27, 2005 Available at http://tnl.net/blog/2005/09/27/google-has-24-billion-items-index-considers-msn-search-nearest-competitor/

 [22] Matthew Cheney and Mike Perry, "A Comparison of the Size of the Yahoo! and Google Indices", 13 October 2006 Available at http://vburton.ncsa.uiuc.edu/indexsize.html

 [23] University of Maryland Baltimore County (UMBC) eBiquity Group Blog, "How many documents are on the Web?", 13 October 2006 Available at http://ebiquity.umbc.edu/blogger/how-many-semantic-web-documents-are-on-the-web/

 [25] Google Guide Web Site, "Using Search Operators (Advanced Operators)", 13 October 2006 Available at http://www.googleguide.com/advanced_operators.html

Sphere: Related Content

TAGS:[ , , , , , , ]

Related posts

Print This Post Print This Post

There Are No BibTex Entries For This Post

Hits for this Post:8931 | Posted in Semantic Web, WWWPosted by VirgoBrain | No Comments »


1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...
No Comments Add Comments



Oct
18th

Mare Magnum or finding a needle in a haystack

Files under WWW | Leave a Comment | 18 Wednesday October 2006
According to Technorati as of July 2005 there were over 14.2 million weblogs. They also stated that the number of double each 5 months ago. By applying simple maths we roughly calculate that nowadays as of October 2006 there hypothetically might be around (2 to the power of (15 months / 5 months)) * 14.2 = 113.6 million web . Somebody else has forecasted that the number of worldwide will exceed 150 million.
So what is the probability that somebody else will read our web ? I suppose this will depend on your page rank for instance, how trendy is your weblog, the social networks that you belong to and so on. I assume this could be formalised by some fantastic equation taking into account some of these premises:
 
"We live in a Mare Magnum where people are trying to find a needle in a haystack!" 

To be honest the purpose of this web blog is just to record the thoughts of the day, the news of the world that influenced on my own consciousness and so on. I will re-read some time in the future what I recorded in the past. We have to acknowledge we are complex and evolving living creatures. In other words I do not represent the same human being I was 10 years ago as the environment and own circumstances have influenced on my own consciousness, my software.  

 

Cheers, Carlos
Sphere: Related Content

TAGS:[ , , , ]

Related posts

Print This Post Print This Post

There Are No BibTex Entries For This Post

Hits for this Post:7103 | Posted in WWWPosted by VirgoBrain | No Comments »


1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...
No Comments Add Comments



Oct
14th

Welcome to VirgoBrain’s Blog

Files under Personal | 4 Comments | 14 Saturday October 2006

carmenycarlos1.jpg

This is my very first posting to ’s Blog.

I you all and invite you virtually to my wedding next July 2007. Carmen and I we will be marrying in Peru where she comes from.

We invite you to leave a message for our next wedding.

Sphere: Related Content

TAGS:[ , , , , ]

Related posts

Print This Post Print This Post

There Are No BibTex Entries For This Post

Hits for this Post:4767 | Posted in PersonalPosted by VirgoBrain | 4 Comments »


1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 3 out of 5)
Loading ... Loading ...
Show Posts Comments Add Comments