Computer Freaks: December 2011

Introduction
(This post is designed to give a starting concepts of text analysis and is not soley related to SmartNewsAggregation Project only)

Text analysis is a very broad topic. People must have been interested in building programs that could understand human languages and interprete them automatically. Well, lots of works have been done in language structure based analysis of text and pattern based analysis. But nobody had achieved an excellent result for the analysis of text based on their semantics.Smart News Aggregation is also a pattern based analysis of text(news in this case) where news from different sources are collected and a few text analysis scoring algorithms are used to calculate the similarity of different texts or articles.(The names of those algorithms are BM25, TermFrequency-InverseDocumentFrequency (TF-IDF), CosineSimilarity)

This was a research project, where some research about the existing systems or algorithms to analyse text were done. Some of the terms which demands familarity are Document-Vector-Model(matrix), Full-Text-Search of mySql(not implemented in this project), Relevenct, Novality, Transition Smoothness etc.
The most accurate among these algorithms was the BM25, and was fast because of the use of lucene API(for indexing of the articles so that the information of terms could be found faster when needed). Actually others might also be quiet fast. The basic steps these algorithms follow are as follows:-

First pass a set or collection of articles(random) to the block(program entity) that divides the articles into words count vs document matrix.Traversing one row(one row represents one word) gives the count of that particular word in different documents (one column represents one separate document).

Then the scoring algorithm takes two document at a time and uses the matrix above to calculate its score, based on the formula

What we did here was multi level/layers of grouping, so that the accuracy went on increasing in each level. For example, when the articles were collected, first the algorithms were used to categorize/sub-categorise the articles so that one level of grouping(dividing groups based on category/sub-category the article belongs) was done with the help of a collection of keywords for each sub-category (manually done).Now we have different categories and subcategories which contains multiple articles belonging to that sub-category. Then related articles were analysed among same/similar sub-categories only.So this also increased the accuracy of the grouping as a whole.
The score made by a document with itself is the highest score, which may be used as a reference score to calculate the threshold of similarity with other documents.

Idea
There may be lots of work that can be done based on text analysis. This is a very interesting topic and lots of projects and research are being done in this particular topic even in PHD level all over the world. This project was focused on news categorization and grouping, others may be Documents clustering and grouping, semantics analysis, artificial intelligence programs to understand human language, text summarization, and lots more.

Problem
A News Site(say NepalWatch): we collect nepali articles from different sources(RSS-FEEDS from other already existing sited). Now the problem is that same news may be coming from different sources with different titles and similar content. We do solve this problem manually by going through each article(may be title) and marking them as related in a group, so that same article donot come in two different places in the same page and should be suggested as related. What we want is to automatise the grouping algorithm so that we should not have to give lots of our time and effort in manually grouping these articles.

Well, there is always a compromise between effort and accuracy. The computer could not be made hundred precent accurate and can't do things as good as a human mind. But it can do preety well which is proved by the result that this algorithm that more than eighty percentage of the articles it suggested as related were actually related, later analysing by humans.

Any Questions/Queries are welcomed any time..........................

Computer Freaks

Pages

Some of my experience about linux system

Integration of wordpress and codeignitor

A Glance to Smart News Aggregation Project