Pages

Showing posts with label text analysis. Show all posts
Showing posts with label text analysis. Show all posts

How to read text content from a url | java

I was involved in this project where the task was to grab the contents from a url, and then to do other stuffs on that content. Other things will be discussed later for this post is dedicated to grabbing the text content of the URL. Initially, we implemented our own logic to do this task, but later I found this amazing API, of which I am going to give an introduction.

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.logging.Level;
import java.util.logging.Logger;

/**
 *
 * @author Rajan Prasad Upadhyay
 */
public class URLReader {
    public static String getUrlContent(String url){
        StringBuilder builder = new StringBuilder();
        try {
            URL uri = new URL(url);
            BufferedReader in = new BufferedReader(new InputStreamReader(uri.openStream()));
            String line;
            while((line = in.readLine()) != null){
                builder.append(line);
                System.out.println(line);
            }
            in.close();
        } catch (MalformedURLException ex) {
            Logger.getLogger(URLReader.class.getName()).log(Level.SEVERE, null, ex);
        } catch (IOException ex) {
            Logger.getLogger(URLReader.class.getName()).log(Level.SEVERE, null, ex);
        }
        return builder.toString();
    }
    
    public static String htmlToTextFilter1(String rawHtml){
        
        String text = rawHtml.replaceAll("\\<.*?>","");
        
        return text;
    }
    
    public static void main(String[] args) {
        String rawHtml = getUrlContent("http://rajanpupa.blogspot.com/");
        
        System.out.println(htmlToTextFilter1(rawHtml));
    }
}


There are better ways to grab texts from the web. Rather then implementing your own code/logic, you could simply use standard API's specifically designed for these kind of purpose. I used opensource api JSoup for the same task and its performance and accuracy is absolutely amazing.

Download the JSoup library from here, and import it in your project.
Here is a snapshot of the code.
     try {
            Document doc = Jsoup.connect("http://rajanpupa.blogspot.com").get();
            String title = doc.title();
            System.out.println(title);
            System.out.println(doc.text());
        } catch (IOException ex) {
            Logger.getLogger(JSourURLReader.class.getName()).log(Level.SEVERE, null, ex);
        }

Text Analysis | Document parser

Long back, I was involved in a project which deals with Text Analysis. It was a preety interesting project which I realized later, when the project was actually completed. Anyways, recently I was surfing odesk to see if there is something I could make a few bucks easily. I found an assignment project where the basic requirement was to be able to parse a text file, and generate a report of number of unique words, their counts etc. I did not get hired for the job, buy I decided to post my work here.

Backthen, the problem was actually to group news article. There are many sources of news in the internet. Our idea was to channel these pieces of articles into our application, run some analyzing and scoring algorithms, make a cluster of related news, and to present the related news in a more suggestive way according to the users interest.

package testapplications;

import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.List;
import java.util.StringTokenizer;
import java.util.Vector;

/**
 *
 * @author Rajan Prasad Upadhyay
 */
/*
 * This class holds one piece of text document. It splits the text into 
 * words along with their count, so that distinct words could be easily found 
 * out. used for indexing, and calculating similarity between documents in collection
 * class
 */
public class Document {

    /*
     * This class is made as a data-structure for holding the words along with their
     * counts in the text when the text is splitted into words in vector.
     */
    private class StringCounter {

        public String word = "";
        public int count = 1;

        public StringCounter(String str) {
            word = str;
        }

        public void increaseCount() {
            count++;
        }

    }

    String text;
    public Vector<String> words = new Vector();
    public Vector<StringCounter> count = new Vector();

    public void sentenceSplitter(String speech) {
        /*default found in java, not used here*/
        /*This will break the text into tokens, words*/
        //String speech = "My name is Rajan Prasad Upadhyay.I am twenty-two years old.";
        StringTokenizer st = new StringTokenizer(speech, ".");
        System.out.println(st.countTokens());
        //st.nextToken(".");
        while (st.hasMoreTokens()) {
            System.out.println(st.nextToken());
        }
    }

    public Document(List<String> lines) {
        text = "";
        for (String str : lines) {
            text += str;
        }
        parse();
        //System.out.println(text);
    }

    public Document(String text) {
        this.text = text;
        parse();
    }

    public Document() {
    }//never use like this
    /*internal operations methods not needed outside*/

    private boolean isValid(char c) {
        //
        if (c == '.' || c == '\'' || c == ',' || c == '?' || c == '!' || c == ':' || c == '\t' || c == '\n') {
            return false;
        } else {
            return true;
        }
    }

    private void parse() {
        /*split the text into words and store in vector*/
        String word = "";
        char c;
        int i = 0;
        int n = text.length();
        while (i < n) {
            c = text.charAt(i++);
            //System.out.print(c);
            if (c != ' ' && isValid(c)) {
                word += c;
                if (i == n) {
                    this.insertWord(word);
                    word = "";
                }
                continue;
            } else {
                if (word.length() > 0) {
                    //System.out.println(word);
                    this.insertWord(word);
                    word = "";
                } else {
                    //ie word.length=0
                    continue;
                }
            }//end else
        }//while

    }
    /*
     * These methods may be needed outside 
     */

    public void insertWord(String word) {
        /*
         * if word is not present in the vector then insert it into the vector
         * and allocate count 1 , otherwise increase the count associated with 
         * the word by 1.
         */
        word = word.trim().toLowerCase();
        //if you want case sensitive, comment the above line
        if (!words.contains(word)) {
            words.add(word);
            count.add(new StringCounter(word));
        } else {
            count.elementAt(words.indexOf(word)).increaseCount();
        }
    }

    public void printWeighedWords() {
        /*
         * This is just a debugging method
         */
        int max = getDistinctWordsCount();
        for (int i = 0; i < max; i++) {
            System.out.println(words.elementAt(i) + " :" + count.elementAt(i).count);
        }
    }

    public int getDistinctWordsCount() {
        /*
         * returns the total number of distinct words
         * ie. if the word "rain" is repeated many times
         * it returns as only one word
         */
        return words.size();
    }

    public int getTotalWordsCount() {
        /*
         * returns total number of words, counts
         * same word multiple times if it is repeated 
         * multiple times
         */
        int i = 0;
        for (int j = 0; j < count.size(); j++) {
            i += count.elementAt(j).count;
        }
        return i;
    }

    public int getCountOfTerm(String term) {
        /*
         * returns the count of term if it is present
         * otherwise returns zero
         */
        int ind = words.indexOf(term);
        if (ind > 0) {
            return count.elementAt(ind).count;
        } else {
            return 0;
        }
    }

    public String toString() {
        String doc = "[";
        int max = getDistinctWordsCount();
        for (int i = 0; i < max; i++) {
            doc += words.elementAt(i) + "(" + count.elementAt(i).count + ")";
            if (i < max - 1) {
                doc += ", ";
            }
        }
        doc += "]";
        return doc;

    }

    public static void main(String[] args) throws IOException {
       //Document doc = parseFile("assignment1.txt");

        Document doc = new Document("Hello World, My name is Rajan. and Hello world, "
                + "I enjoy doing this kind of works.");

        System.out.println("Total words count: " + doc.getTotalWordsCount());
        System.out.println("Distinct words count: " + doc.getDistinctWordsCount());
        System.out.println("Weighed words: \n");
        System.out.println(doc);

    }

    public static Document parseFile(String filepath) {
        // String filepath = "assignment1.txt";
        Charset ENCODING = StandardCharsets.UTF_8;

        Path path = Paths.get(filepath);

        //System.out.println(Files.readAllLines(path, ENCODING));
        Document doc = null;
        try {
            doc = new Document(Files.readAllLines(path, ENCODING));
            int count = doc.getDistinctWordsCount();
            System.out.println("Words count:  " + count);
            System.out.println("Words:  " + doc.words);
            doc.printWeighedWords();
        } catch (IOException e) {
            System.out.println("error");
            e.printStackTrace();
        }
        return doc;
    }
}

A Glance to Smart News Aggregation Project

Introduction
(This post is designed to give a starting concepts of text analysis and is not soley related to SmartNewsAggregation Project only)

Text analysis is a very broad topic. People must have been interested in building programs that could understand human languages and interprete them automatically. Well, lots of works have been done in language structure based analysis of text and pattern based analysis. But nobody had achieved an excellent result for the analysis of text based on their semantics.Smart News Aggregation is also a pattern based analysis of text(news in this case) where news from different sources are collected and a few text analysis scoring algorithms are used to calculate the similarity of different texts or articles.(The names of those algorithms are BM25, TermFrequency-InverseDocumentFrequency (TF-IDF), CosineSimilarity)


This was a research project, where some research about the existing systems or algorithms to analyse text were done. Some of the terms which demands familarity are Document-Vector-Model(matrix), Full-Text-Search of mySql(not implemented in this project), Relevenct, Novality, Transition Smoothness etc.
The most accurate among these algorithms was the BM25, and was fast because of the use of lucene API(for indexing of the articles so that the information of terms could be found faster when needed). Actually others might also be quiet fast. The basic steps these algorithms follow are as follows:-
  • First pass a set or collection of articles(random) to the block(program entity) that divides the articles into words count vs document matrix.Traversing one row(one row represents one word) gives the count of that particular word in different documents (one column represents one separate document).
  • Then the scoring algorithm takes two document at a time and uses the matrix above to calculate its score, based on the formula
What we did here was multi level/layers of grouping, so that the accuracy went on increasing in each level. For example, when the articles were collected, first the algorithms were used to categorize/sub-categorise the articles so that one level of grouping(dividing groups based on category/sub-category the article belongs) was done with the help of a collection of keywords for each sub-category (manually done).Now we have different categories and subcategories which contains multiple articles belonging to that sub-category. Then related articles were analysed among same/similar sub-categories only.So this also increased the accuracy of the grouping as a whole.
The score made by a document with itself is the highest score, which may be used as a reference score to calculate the threshold of similarity with other documents.

Idea
There may be lots of work that can be done based on text analysis. This is a very interesting topic and lots of projects and research are being done in this particular topic even in PHD level all over the world. This project was focused on news categorization and grouping, others may be Documents clustering and grouping, semantics analysis, artificial intelligence programs to understand human language, text summarization, and lots more.

Problem
A News Site(say NepalWatch): we collect nepali articles from different sources(RSS-FEEDS from other  already existing sited). Now the problem is that same news may be coming from different sources with different titles and similar content. We do solve this problem manually by going through each article(may be title) and marking them as related in a group, so that same article donot come in two different places in the same page and should be suggested  as related. What we want is to automatise the grouping algorithm so that we should not have to give lots of our time and effort in manually grouping these articles.

Well, there is always a compromise between effort and accuracy. The computer could not be made hundred precent accurate and can't do things as good as a human mind. But it can do preety well which is proved by the result that this algorithm that more than eighty percentage of the articles it suggested as related were actually related, later analysing by humans.

Any Questions/Queries are welcomed any time..........................