I want to do a riddle AI chatbot for my AI class. So i figgured the input to the chatbot would be :
Something like : "It is blue, and it is up, but it is not the ceiling"
<Object X> <blue> <up> <!ceiling> </Object X>
(Answer : sky?)
So Input is a set of characteristics (existing \ not existing in the object), output is a matched, most likely object.
The domain will be limited to a number of objects, i could input all attributes myself, but i was thinking :
How could I programatically build a database of characteristics for a word? Is there such a database available? How could i tag a word, how could i programatically find all it's attributes? I was thinking on crawling wikipedia, or some forum, but i can't see it build any reliable word tag database.
Any ideas on how i could achieve such a thing? Any ideas on some literature on the subject?
This sounds like a basic classification problem. You're essentially asking; given N features (color=blue, location=up, etc), which of M classifications is the most likely? There are many algorithms for accomplishing this (Naive Bayes, Maximum Entropy, Support Vector Machine), but you'll have to investigate which is the most accurate and easiest to implement. The biggest challenge is typically acquiring accurate training data, but if you're willing to restrict it to a list of manually entered examples, then that should simplify your implementation.
Your example suggests that whatever algorithm you choose will have to support sparse data. In other words, if you've trained the system on 300 features, it won't require you to enter all 300 features in order to get an answer. It'll also make your training and testing files smaller, because you'll be omit features that are irrelevant for certain objects. e.g.
sky | color:blue,location:up tree | has_bark:true,has_leaves:true,is_an_organism=true cat | has_fur:true,eats_mice:true,is_an_animal=true,is_an_organism=true
It might not be terribly helpful, since it's proprietary, but a commercial application that's similar to what you're trying to accomplish is the website 20q.net, albeit the system asks the questions instead of the user. It's interesting in that it's trained "online" based on user input.
Wikipedia certainly has a lot of data, but you'll probably find extracting that data for your program will be very difficult. Cyc's data is more normalized, but its API has a huge learning curve. Another option is the semantic dictionary project Wordnet. It has reasonably intuitive APIs for nearly every programming language, as well as an extensive hypernym/hyponym model for thousands of words (e.g. cat is a type of feline/mammal/animal/organism/thing).
The content is written by members of the stackoverflow.com community.
It is licensed under cc-wiki