Harvesting Email Addresses from a Website using Java

In this tutorial we’ll be creating a small java command line application to extract email addresses from websites, a program like this comes in handy for people who are into advertising and stuff.

So before we jump right into programming, lets think about the possible steps of the program.

As we are extracting emails from a website so we are definitely going to be asking the user to input the URL of the website. Once we have the website, we won’t magically have all the emails but we will have to get the contents of the URL first. Now that we have the contents, how are we going to extract the emails? Yes you have guessed it right, we will definitely be using REGEX.

So the list of steps are:

  1. Get website URL from the user
  2. Get the contents of the URL
  3. Run REGEX on the contents
  4. Print out email addresses extracted by the REGEX from the contents.

Now that we have a basic layout of our program, lets start coding part by part and we’ll add possible improvements on the way but first we will create our EmailExtractor class.

/**
* @author ex094
*/
public class EmailExtractor {

}

 

Handling URL

We will be initializing the EmailExtractor with a URL which the user will input via command line arguments but we will cover that part in the end, for now we will create the constructor for the EmailExtractor which will take a URL as an argument and then initialize the URL Object.

So the code is

import java.net.URL;

/**
 * @author ex094
 */
public class EmailExtractor {

    URL url; //URL Instance Variable

    EmailExtractor(String url) {
        this.url = new URL(url); //Initalizing our URL object
    }

}

If you are new to the Java URL Class, it simply allows us to open a connection to the specified URL and then read data from it. You must specify the protocol (http/https) in the URL otherwise URL will throw the MalformedURLException, hence we will enclose the statement in Try..Catch.

import java.net.MalformedURLException;
import java.net.URL;

/**
 * @author ex094
 */
public class EmailExtractor {

    URL url; //URL Instance Variable

    EmailExtractor(String url) {

        try {
            this.url = new URL(url); //Initalizing our URL object
        } catch (MalformedURLException ex) {
           System.out.println("Please include Protocol in your URL e.g. http://www.google.com");
           System.exit(1);
        }
    }
}

Getting URL Contents

In the previous section we initialized our URL object to hold the user URL, what now we need to do is read the contents of the URL and store it inside a variable so that we can later apply regex and extract email addresses from it.

Lets create a method readContents which will read the contents off from the URL. It uses a BufferReader to read the InputStream from the URL object and then save the contents in a StringBuilder variable.

mport java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;

/**
 * @author ex094
 */
public class EmailExtractor {

    URL url; //URL Instance Variable
    StringBuilder contents; //Stores our URL Contents

    EmailExtractor(String url) {

        try {
            this.url = new URL(url); //Initalizing our URL object
        } catch (MalformedURLException ex) {
           System.out.println("Please include Protocol in your URL e.g. http://www.google.com");
           System.exit(1);
        }
    }

    public void readContents() {

       //Open Connection to URL and get stream to read
        BufferedReader read = new BufferedReader(new InputStreamReader(url.openStream()));
        contents = new StringBuilder();
        //Read and Save Contents to StringBuilder variable
        String input = "";
        while((input = read.readLine()) != null) {
            contents.append(input);
        }

    }
}

The url.openStream() basically opens the connection with the URL, then returns an InputStream so that we can read the data from the URL, The BufferedReader reads block of characters from the InputStreamReader.

The readContents method is complete but there’s a problem, if the URL supplied by the user is in correct format but doesn’t actually exist on the internet, the url.openStream() will throw an IOException hence we need to handle that exception too, so we just surround the whole block with Try..Catch.

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;

/**
 * @author ex094
 */
public class EmailExtractor {

    URL url; //URL Instance Variable
    StringBuilder contents; //Stores our URL Contents

    EmailExtractor(String url) {

        try {
            this.url = new URL(url); //Initalizing our URL object
        } catch (MalformedURLException ex) {
           System.out.println("Please include Protocol in your URL e.g. http://www.google.com");
           System.exit(1);
        }
    }

    public void readContents() {
        try {
            //Open Connection to URL and get stream to read
            BufferedReader read = new BufferedReader(new InputStreamReader(url.openStream()));
            contents = new StringBuilder();
            //Read and Save Contents to StringBuilder variable
            String input = "";
            while((input = read.readLine()) != null) {
                contents.append(input);
            }
        } catch (IOException ex) {
            System.out.println("Unable to read URL due to Unknown Host..");
        }
    }
}

Now if the user enters a URL like http://123asd.com which doesn’t exist, our program will throw an exception Unable to read URL due to Unknown Host..

Extracting Email Addresses Using REGEX

When we obtain the contents of the URL, it’ll be in a messy HTML form, Using a regular expression pattern for email address, we can find out the matching strings in the content.

The regular expression used is: \b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b

We will create a extractEmail method which will use regex to search for email addresses in the contents and once it gets a hit, it’ll store that email address inside an String ArrayList but due to the fact that sometimes emails might get repeated so to maintain uniqueness, we will use Set Data Structure.

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * @author ex094
 */
public class EmailExtractor {

    String pattern = "\\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z0-9.-]+\\b"; //Email Address Pattern
    URL url; //URL Instance Variable
    StringBuilder contents; //Stores our URL Contents
    Set<String> emailAddresses = new HashSet<>(); //Contains unique email addresses

    EmailExtractor(String url) {

        try {
            this.url = new URL(url); //Initalizing our URL object
        } catch (MalformedURLException ex) {
           System.out.println("Please include Protocol in your URL e.g. http://www.google.com");
           System.exit(1);
        }
    }

    public void readContents() {
        try {
            //Open Connection to URL and get stream to read
            BufferedReader read = new BufferedReader(new InputStreamReader(url.openStream()));
            contents = new StringBuilder();
            //Read and Save Contents to StringBuilder variable
            String input = "";
            while((input = read.readLine()) != null) {
                contents.append(input);
            }
        } catch (IOException ex) {
            System.out.println("Unable to read URL due to Unknown Host..");
        }
    }

    public void extractEmail() {
        //Creates a Pattern
        Pattern pat = Pattern.compile(pattern);
        //Matches contents against the given Email Address Pattern
        Matcher match = pat.matcher(contents);
        //If match found, append to emailAddresses
        while(match.find()) {
            emailAddresses.add(match.group());
        }
    }
}

Printing out Email Addresses

To print out email addresses to the command line from the emailAddresses set, we will create a method printAddresses

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * @author ex094
 */
public class EmailExtractor {

    String pattern = "\\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z0-9.-]+\\b"; //Email Address Pattern
    URL url; //URL Instance Variable
    StringBuilder contents; //Stores our URL Contents
    Set<String> emailAddresses = new HashSet<>(); //Contains unique email addresses

    EmailExtractor(String url) {

        try {
            this.url = new URL(url); //Initalizing our URL object
        } catch (MalformedURLException ex) {
           System.out.println("Please include Protocol in your URL e.g. http://www.google.com");
           System.exit(1);
        }
    }

    public void readContents() {
        try {
            //Open Connection to URL and get stream to read
            BufferedReader read = new BufferedReader(new InputStreamReader(url.openStream()));
            contents = new StringBuilder();
            //Read and Save Contents to StringBuilder variable
            String input = "";
            while((input = read.readLine()) != null) {
                contents.append(input);
            }
        } catch (IOException ex) {
            System.out.println("Unable to read URL due to Unknown Host..");
        }
    }

    public void extractEmail() {
        //Creates a Pattern
        Pattern pat = Pattern.compile(pattern);
        //Matches contents against the given Email Address Pattern
        Matcher match = pat.matcher(contents);
        //If match found, append to emailAddresses
        while(match.find()) {
            emailAddresses.add(match.group());
        }
    }

    public void printAddresses() {
        //Check if email addresses have been extracted
        if(emailAddresses.size() &amp;amp;gt; 0) {
            //Print out all the extracted emails
            System.out.println("Extracted Email Addresses: ");
            for(String emails : emailAddresses) {
                System.out.println(emails);
            }
        } else {
            //In case, no email addresses were extracted
            System.out.println("No emails were extracted!");
        }
    }
}

The printAddresses method will first check that if the Set is not empty i.e there are emails in the Set, if there are then it’ll print all of the email address and if no email addresses were found in the website contents i.e. the Set, containing the email addresses, size is zero then it’ll print No emails were extracted!

Saving Email Addresses to a Text File (Extra)

Suppose a site you just scraped contains 1000 email address and all of em gets printed on your terminal, it’s time consuming and annoying to copy and paste them, scroll down. So instead we can create a method called saveAddresses which will save all the extracted email address to a file with the name that the user assigns it.

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.nio.charset.Charset;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * @author ex094
 */
public class EmailExtractor {

    String pattern = "\\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z0-9.-]+\\b"; //Email Address Pattern
    URL url; //URL Instance Variable
    StringBuilder contents; //Stores our URL Contents
    Set<String> emailAddresses = new HashSet<>(); //Contains unique email addresses

    EmailExtractor(String url) {

        try {
            this.url = new URL(url); //Initalizing our URL object
        } catch (MalformedURLException ex) {
           System.out.println("Please include Protocol in your URL e.g. http://www.google.com");
           System.exit(1);
        }
    }

    public void readContents() {
        try {
            //Open Connection to URL and get stream to read
            BufferedReader read = new BufferedReader(new InputStreamReader(url.openStream()));
            contents = new StringBuilder();
            //Read and Save Contents to StringBuilder variable
            String input = "";
            while((input = read.readLine()) != null) {
                contents.append(input);
            }
        } catch (IOException ex) {
            System.out.println("Unable to read URL due to Unknown Host..");
        }
    }

    public void extractEmail() {
        //Creates a Pattern
        Pattern pat = Pattern.compile(pattern);
        //Matches contents against the given Email Address Pattern
        Matcher match = pat.matcher(contents);
        //If match found, append to emailAddresses
        while(match.find()) {
            emailAddresses.add(match.group());
        }
    }

    public void printAddresses() {
        //Check if email addresses have been extracted
        if(emailAddresses.size() &amp;amp;gt; 0) {
            //Print out all the extracted emails
            System.out.println("Extracted Email Addresses: ");
            for(String emails : emailAddresses) {
                System.out.println(emails);
            }
        } else {
            //In case, no email addresses were extracted
            System.out.println("No emails were extracted!");
        }
    }

    public void saveAddresses(String filename) {
        //Create a new .txt file
        File file = new File(filename + ".txt");
        //Setting charset
        Charset charset = Charset.forName("UTF-8");

        //Create a BufferedWriter to write emails to the file
        try(BufferedWriter write = new BufferedWriter(new FileWriter(file))) {
            //Write each email address on a newline in the file
            for(String emails : emailAddresses) {
                write.write(emails);
                write.newLine();
            }
        } catch (IOException ex) {
            System.out.println("Could not save email addresses to text file!");
        }
    }
}

The File object creates a new text file by the name that is passed as an argument to the method, the BufferedWriter will write the the email addresses  to the text file, each on a newline. In case there’s a problem writing the file, the Try..Catch block will handle the IO exception.

Command Line Arguments

So we are basically done creating our EmailExtractor class and it’s essential methods. Now what we need to do is handle our user inputs. Time for us to create our main method,

public static void main (String[] args) {

}

You’ve written this piece of code thousands of times yet have you ever wondered what String[] args mean?

  • String[] args is simply an array of Strings, that contains command line arguments passed by the user.

Initially the args array is empty,

  • args = []

So when you type in your terminal something like:

  • java EmailExtractor hello world

The args array becomes,

  • args = [“hello”, “world”]

Since it’s a typical Java Array, we can access the passed arguments using index. So if I wanted to see what the first argument the user has passed, I would simply do

  • System.out.println(args[0]);

And it will print hello.

Another thing to keep in mind is that args is just the name of the array, you can name it anything like String[] myArguments but it’s recommended that you follow the convention and keep it as String[] args.

Handling Command Line Arguments

For our application here, we will have 2 arguments

  1. URL of the website
  2. Save Email Addresses

Out of which the first argument is necessary and the 2nd one is optional, whether you want to save the file or not. When you run the app with just the URL as the argument, it’ll extract the email addresses and print them by default. But if you want to save those email addresses to a text file you need to add an extra argument followed by another argument that is the name of the file,

-s is the argument that will indicate that the user wants to save the file, and emails is the name of that file. So our main method becomes

public static void main (String args[]) {
        EmailExtractor extract;

        //Check if arguments are supplied and URL is supplied
        if(args.length > 0 && args[0] != null) {

            //If length of args is 3 and -s in args, then save the emails
            if(args.length == 3 && args[1] != null && args[1].equals("-s") && args[2] != null) {

            //Just print them normally
            } else {

            }
        } else {
            System.out.println("Invalid Arguments supplied...");
        }
    }

We are checking in the If condition that the arguments are supplied by the user specially the URL, other wise if  the url is not supplied , it’ll simply tell the user Invalid Arguments Supplied… Now if the URL is included as arg and -s along with the file name is also been input by the user then we will save the email addresses in a file using the saveAddresses method else the list of email addresses will be simply displayed. Now our code becomes,

public static void main (String args[]) {

        EmailExtractor extract;

        //Check if arguments are supplied and URL is supplied
        if(args.length > 0 && args[0] != null) {

            extract = new EmailExtractor(args[0]);//Initalize Extractor with URL
            extract.readContents(); //Read the URL contents
            extract.extractEmail(); //Extract the email addresses

            //If -s in args, then save the emails
            if(args.length == 3 && args[1] != null && args[1].equals("-s") && args[2] != null) {
                extract.saveAddresses(args[2]); //Save the email address in a file with name from args[2]
            //Just print them normally
            } else {
                extract.printAddresses(); //Otherwise normally display the email addresses
            }
        } else {
            System.out.println("Invalid Arguments supplied...");
        }
    }

And our Email Extractor is complete, Build the jar file using NetBeans and run this command on the terminal:

It’ll produce the following output:

Screenshot from 2016-08-25 23-26-53

The complete code:

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.nio.charset.Charset;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * @author ex094
 */
public class EmailExtractor {

    String pattern = "\\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z0-9.-]+\\b"; //Email Address Pattern
    URL url; //URL Instance Variable
    StringBuilder contents; //Stores our URL Contents
    Set<String> emailAddresses = new HashSet<>(); //Contains unique email addresses

    EmailExtractor(String url) {

        try {
            this.url = new URL(url); //Initalizing our URL object
        } catch (MalformedURLException ex) {
           System.out.println("\tPlease include Protocol in your URL e.g. http://www.google.com");
           System.exit(1);
        }
    }

    public void readContents() {
        try {
            //Open Connection to URL and get stream to read
            BufferedReader read = new BufferedReader(new InputStreamReader(url.openStream()));
            contents = new StringBuilder();
            //Read and Save Contents to StringBuilder variable
            String input = "";
            while((input = read.readLine()) != null) {
                contents.append(input);
            }
        } catch (IOException ex) {
            System.out.println("\tUnable to read URL due to Unknown Host..");
        }
    }

    public void extractEmail() {
        //Creates a Pattern
        Pattern pat = Pattern.compile(pattern);
        //Matches contents against the given Email Address Pattern
        Matcher match = pat.matcher(contents);
        //If match found, append to emailAddresses
        while(match.find()) {
            emailAddresses.add(match.group());
        }
    }

    public void printAddresses() {
        //Check if email addresses have been extracted
        if(emailAddresses.size() > 0) {
            //Print out all the extracted emails
            System.out.println("\tExtracted Email Addresses: ");
            for(String emails : emailAddresses) {
                System.out.println(emails);
            }
        } else {
            //In case, no email addresses were extracted
            System.out.println("\tNo emails were extracted!");
        }
    }

    public void saveAddresses(String filename) {
        //Create a new .txt file
        File file = new File(filename + ".txt");
        //Setting charset
        Charset charset = Charset.forName("UTF-8");

        //Create a BufferedWriter to write emails to the file
        try(BufferedWriter write = new BufferedWriter(new FileWriter(file))) {
            //Write each email address on a newline in the file
            for(String emails : emailAddresses) {
                write.write(emails);
                write.newLine();
            }
            System.out.println("\tEmails have been saved to " + filename + ".txt");
        } catch (IOException ex) {
            System.out.println("\tCould not save email addresses to text file!");
        }
    }

    public static void main (String args[]) {

        EmailExtractor extract;

        //Check if arguments are supplied and URL is supplied
        if(args.length > 0 && args[0] != null) {

            extract = new EmailExtractor(args[0]);//Initalize Extractor with URL
            extract.readContents(); //Read the URL contents
            extract.extractEmail(); //Extract the email addresses

            //If -s in args, then save the emails
            if(args.length == 3 && args[1] != null && args[1].equals("-s") && args[2] != null) {
                extract.saveAddresses(args[2]); //Save the email address in a file with name from args[2]
            //Just print them normally
            } else {
                extract.printAddresses(); //Otherwise normally display the email addresses
            }
        } else {
            System.out.println("\tInvalid Arguments supplied...");
        }
    }
}

This tutorial was fun, I’ll write a separate tutorial about Command Line Arguments in Java so that if you have any kind of confusion regarding that topic, you can clear it up. Have fun coding 🙂

Regards,
Ex094

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s