MARVIN was designed as a multi-agent softbot (Fig.1). Each agent possesses filtering capabilities. The agent downloads Web pages and computes the medical "score" of each page. Using a glossary of medical terms which calculates the frequency of the appearance of words in the glossary.Categorising documents: medical or not?
The score processed by MARVIN defines if a Web page is medical or health-related or not by adding up the number of medical terms in the document, taking into account the different translations and the weight of each medical terms as defined by the built-in glossary.
In the medical domain many thesaurusi and glossaries already existed such as the MeSH (Medical Subject Headings) from the National Library of Medicine (NLM) and the glossary in nine European languages developed at the Heymans Institute of Pharmacology, University of Ghent, Belgium, within the framework of a European project. For our application, HON built its own thesaurus by compiling several of these sources. Starting with bilingal (English/French) medical terms (12,000), the thesaurus was expanded with Danish, Dutch, German, Italian, Portuguese and Spanish, resulting in a thesaurus of 20,000 multilingual medical terms (not counting the 33,000 MeSH terms).
Studies were undertaken to provide an estimate of the relative importance of a term in a document and in a collection of documents, allowing us to weight each medical term included in our medical glossary. 1,000 documents known to be related to the medical and health topics and 1,000 related to other domains except medical and health were analysed. The medical terms included in each Web page were then evaluated. This study, associated with other techniques such as the formula of Wilbur and Yang (An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts, Comp. Bio. Med. 26.3 p. 209-222, 1996) allowed us to define a threshold for each terms contained in our medical glossary.Using our multilingual medical thesaurus of 50,000 terms, the download of Web pages and the calculation of a score according to the page content, MARVIN generates using a classical inverted index: in which each word is associated with the list of documents containing the word. Matching the requested terms is then a simple and efficient task.