<body><script type="text/javascript"> function setAttributeOnload(object, attribute, val) { if(window.addEventListener) { window.addEventListener('load', function(){ object[attribute] = val; }, false); } else { window.attachEvent('onload', function(){ object[attribute] = val; }); } } </script> <div id="navbar-iframe-container"></div> <script type="text/javascript" src="https://apis.google.com/js/plusone.js"></script> <script type="text/javascript"> gapi.load("gapi.iframes:gapi.iframes.style.bubble", function() { if (gapi.iframes && gapi.iframes.getContext) { gapi.iframes.getContext().openChild({ url: 'https://www.blogger.com/navbar.g?targetBlogID\x3d6418452\x26blogName\x3dFootsteps+on+Clouds\x26publishMode\x3dPUBLISH_MODE_BLOGSPOT\x26navbarType\x3dBLACK\x26layoutType\x3dCLASSIC\x26searchRoot\x3dhttp://chirayu.blogspot.com/search\x26blogLocale\x3den_US\x26v\x3d2\x26homepageUrl\x3dhttp://chirayu.blogspot.com/\x26vt\x3d7754879049997020549', where: document.getElementById("navbar-iframe-container"), id: "navbar-iframe" }); } }); </script>

Sunday, November 12, 2006

Thoughts on Why an India-specific Search Engine in English is a Misnomer

Why India-specific content in English is a Misnomer or why Search Engines that position themselves as India-specific should ideally Index the entire Web for English language content.

For regional languages positioning a search engine as India-specific is obvious and fine, but for English language search, any search engine that wants to provide good results needs to index the entire Web and not only India-specific content, because per se, India-specific content itself doesn't exist, because the Indian Internet user is looking for content that is global in nature.

What is India-specific content and how does one find out?

What if a collegian from a small town in India is searching for a University in the USA, and what if an Indian software professional from San Francisco is searching for devotional folk songs from India?

Now, a search engine that positions itself as India-specific needs to address both these audiences, and if India-specific content is collected only by IP filtering, domain filtering, and gathering content from DMOZ India and other directories, then the content that comes up say 150 million Web pages is not enough, because the Indian searcher is looking for global information, and a search engine that falls into the trap of 'India-specific' content tends to ignore indexing the entire English-language Web content, and misses on information that people from India are searching for.

Let's take an example: If I search for Cricket on Guruji.com I get the first result as cricket.deepthi.com which addresses the query but doesn't provide the most optimum result, because the best result, in my opinion is a more popular Web site like say Cricinfo or Cricinfo India in an Indian context; but because of domain filtering, IP filtering, and other filters that may have been set, Guruji's crawler GurujiBot would not have indexed those sites. This is where a search engine that positions itself as India-specific misses out.

For English language searches, an Indian search engine like Guruji.com would be competing with giants like Google, Yahoo!, and MSN (Live). Now Google.co.in provides results from all over the Web, so do Yahoo! and Live. If those major search engines provide results that are more relevant to search queries then people would use them, why would they use Guruji.com for English search, if it can't provide English-language search results at par with those of global search engines.

An example. Search for Angelina Jolie on Guruji. Now it isn't an India-specific keyword, but Indians are searching for Angelina Jolie and a search engine that fails to index Web sites related to Hollywood, because it isn't India-specific, fails to do so at its own peril.

- Ideally an Indian search engine should index the entire Web for English-language content. Filters may be set for not crawling International languages like German, French, Spanish, Italian etc. But even for India, the English language content should be that of the entire Web and not sorted by domain or IP filters.

- Indians are searching for news, travel, sports, celebrities, religion, education, entertainment, shopping, movies, music, social networking, matrimony sites, health, reference, jobs, adult content, and more. This is the trend almost globally as seen from Google Zeitgeist.

- What is the profile of an average Indian Internet user?
Given a choice between using English or typing in regional languages, most people that know both English and regional languages would still like to type, write, chat, surf, search, browse, shop online in English, because it is the preferred medium of communication for the average Indian Internet user.

- How many people have access to the Internet in India?
As per September 2006 numbers from The Internet and Mobile Association of India (IAMAI) there are 37 million people that access the Internet in India, that is less than 3.5per cent of the population. Of these 39 per cent access the Internet from cyber cafes, 31 per cent from home, and 22 per cent at work. 32 per cent of these 37 million users from India use the Internet as the primary source for information and research.

Of these 12 million (~32 per cent of 37 million) people how many of them would be using a regional language as their primary medium of accessing the Internet? Maybe 0.1 per cent or even less. So the potential of regional language search engines as a revenue generating medium arises when say 10 per cent of India's population (~ 120 million) has access to the Internet, which on an optimistic note should happen in the next five years, although there aren't any numbers for sure to say it would take five years or ten or more for Indian regional search to be a viable stand-alone revenue-generating model for search engines operating in India.

- So for an Indian search engine to survive in the short run, it needs to deliver English search results that are as good as those provided by Google, Yahoo!, Live, and Ask.

All Indian regional languages content combined would make for less than 10 per cent of English language content taken only from Indian (*.in/*.co.in/*.gov.in/*.ac.in/*.nic.in/*.com by IP filtering) Web sites, leave aside global English content.

- There may be infrastructure, hardware, Internet connectivity, programming resources, and monetary limitations for crawling the entire Web; but not crawling the Web because there's a limitation to include only India-specific content in the index would be a fallacy, because Indian content is not only related to India or Indian Web sites or sites with the domain as India; it is global in nature.

- That means even a search engine that positions itself as an Indian search engine should have global content for its English language. At least 10 billion Web pages in its English language index. This approximate number is found by finding the number of results for a common English word like say and, or, the, for, but etc., that is likely to be found on any English-language Web page.

- On a regional scale, if an Indian search engine finds more content than say only the Unicode (UTF-8) pages indexed by Google, it has a chance of upstaging the giant. But that's only for local languages where the PPC advertising market is not as attractive as in English.

- For a search in English, if a user searches on Guruji and doesn't get the result, the difference between that user going to the home page of Google or Yahoo is only an Alt+F4.