I was involved in this project where the task was to grab the contents from a url, and then to do other stuffs on that content. Other things will be discussed later for this post is dedicated to grabbing the text content of the URL. Initially, we implemented our own logic to do this task, but later I found this amazing API, of which I am going to give an introduction.
There are better ways to grab texts from the web. Rather then implementing your own code/logic, you could simply use standard API's specifically designed for these kind of purpose. I used opensource api JSoup for the same task and its performance and accuracy is absolutely amazing.
Download the JSoup library from here, and import it in your project.
Here is a snapshot of the code.
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.logging.Level;
import java.util.logging.Logger;
/**
 *
 * @author Rajan Prasad Upadhyay
 */
public class URLReader {
    public static String getUrlContent(String url){
        StringBuilder builder = new StringBuilder();
        try {
            URL uri = new URL(url);
            BufferedReader in = new BufferedReader(new InputStreamReader(uri.openStream()));
            String line;
            while((line = in.readLine()) != null){
                builder.append(line);
                System.out.println(line);
            }
            in.close();
        } catch (MalformedURLException ex) {
            Logger.getLogger(URLReader.class.getName()).log(Level.SEVERE, null, ex);
        } catch (IOException ex) {
            Logger.getLogger(URLReader.class.getName()).log(Level.SEVERE, null, ex);
        }
        return builder.toString();
    }
    
    public static String htmlToTextFilter1(String rawHtml){
        
        String text = rawHtml.replaceAll("\\<.*?>","");
        
        return text;
    }
    
    public static void main(String[] args) {
        String rawHtml = getUrlContent("http://rajanpupa.blogspot.com/");
        
        System.out.println(htmlToTextFilter1(rawHtml));
    }
}
There are better ways to grab texts from the web. Rather then implementing your own code/logic, you could simply use standard API's specifically designed for these kind of purpose. I used opensource api JSoup for the same task and its performance and accuracy is absolutely amazing.
Download the JSoup library from here, and import it in your project.
Here is a snapshot of the code.
     try {
            Document doc = Jsoup.connect("http://rajanpupa.blogspot.com").get();
            String title = doc.title();
            System.out.println(title);
            System.out.println(doc.text());
        } catch (IOException ex) {
            Logger.getLogger(JSourURLReader.class.getName()).log(Level.SEVERE, null, ex);
        }
 
No comments:
Post a Comment
If you like to say anything (good/bad), Please do not hesitate...