Fixing the search at DLI (Digital Library of India)

Posted: July 28, 2010 in books, india, rants, telugu, work

Recently I came to know about DLI through this post on Sowmya’s blog. DLI is an admirable effort to digitize old books and make them accessible online. They have a nice collection of rare old books in many Indian languages.

In her post, Sowmya explains how hard it is to search for books on that site. They have used a lot of different non-standard ways to spell the names of Indian authors. If that’s not ludicrous enough, they have different spellings for same names in different titles. For example, Lakshmi goes with the spelling ‘Laxmi’ in one book, and as ‘laq-smi’ in another. Well, you can imagine how tedious it would be to search for something in such a database. It got me thinking to see if there is a simple way to fix this search problem. It occurred to me that using SoundEx based search can solve this problem. It’s such a simple thing to implement if they want to. Pretty much all database systems have built-in support for SoundEx indexing these days. For a moment, I considered implementing it myself by crawling the website and building an index. It’s doable because we  don’t have to replicate the entire DLI database. All we need to do is  just store all the unique names that appear with their different spellings; that can’t be very big.  We can use this database to retrieve the list of alternate spellings that are used for any given name. We submit all those spellings to the DLI search, and Bingo! we have the results that we are looking for. For example, if someone searches for ‘Lakshmi’, SoundEx lookup in our database would give us the other spellings ‘Laxmi’, and ‘Laq-smi’. We submit three queries to DLI with these three different spellings.
Anyway, I dropped the idea for now since I don’t have a place to host such a system at my disposal. Besides, It doesn’t seem like there are many people using DLI to warrant that effort.
While we are on the topic, few more things that came to mind when I looked at DLI: The website proudly proclaims that “For the first time in history, the Digital Library of India is digitizing all the significant works of Mankind”. Tall claims without substance, typical of us Indians. The entire website looks very crudely done and can use some improvements. I don’t understand why they are digitizing English works published in other countries. There are efforts like Project Gutenberg that are doing a finer job.

  1. jai says:

    Some of probabilistic matching can address this issue too.

What do you think?

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s