Subject Based Semantic Document Clustering

[wp_ulike id="326"]

In the digital document world, diverse file formats create a language barrier. Most software grasp only their formats, necessitating translators like import modules or display plugins. A versatile search engine becomes crucial, as few offer similar functions. To build such an engine, it must comprehend various file formats and their text storage techniques. Initially, define the subject with keywords for document search. WordNet helps find synonyms. Next, differentiate file types and extract text content for searching. Cluster and display documents with the same subject or synonyms, highlighting them.The application offers an appealing GUI and graphical search representation.

Module Description

Pre-processing of initial subject and keywords

This module accepts a set of keyword including subject as input. the application can find out the parts of speech of each key word and finally find out the synonyms for each part of speech for each word.

Text Extraction and document pre-processing

This module takes a folder, drive or removable disk as input. It identifies which type of document is given as input and use appropriate parser to extract the text content from it. Once the text is extracted it is tokenized and stored.

Semantic Searching and clustering Module

We are able to search for the documents which contain the keywords and synonyms. If a match is found, the document will be displayed in the output window. The document can be open with its corresponding file format. The synonym part of the document will be highlighted and documents will be listed in a manner that the last modified document will be the first item.

User Interface Module

This module deals with Graphical User Interface.

Document Search Engine
File Format Compatibility
Text Content Extraction