Question

Link Extractor For this project you are to build a command-line tool that can extract the...

Link Extractor


For this project you are to build a command-line tool that can extract the links from any URL on the web.
The tool should be implemented in standard C++, although it is perfectly reasonable to consume API's
specific to the operating system you are building on since standard C++ doesn't provide native support for
networking.
The tool should have the following behavior when run from the command line (text doesn't have to match
exactly, the example just gives a general idea of the functionality):
> .\tool.exe
> Please enter a url address: http://example.com
> The following links were found:
> http://example.com/page1
> http://example.com/page2
> http://google.com/search
> http://mdn.com/asdf
> Done!
In order to achieve this, you will need to design a program so that it is able to read the url entered at the
console, download the contents of that url over the network, and parse the HTML result for anchor <a> tags
containing href elements.
The program should be robust, and not crash, even when given an invalid url, or when given a url that points
to something other than HTML. If the program is not able to parse html from the contents returned from the
internet, it is fine to silently abort, or print an error message, but crashing is not acceptable.
The program should be designed well. You should take advantage of classes, containers, and modern C++
algorithms.

0 0
Add a comment Improve this question Transcribed image text
Answer #1

There is a lot of boiler plate code in the function getURLToFile as C++ does not have native support for networking. Also this will only work on windows. Please find the commented code below:

................................................................CODE STARTS HERE.......................................................................................

#include <windows.h>
#include <fstream>
#include <iostream>
#include <set>
#include <regex>
using namespace std;

typedef HRESULT (WINAPI *UDTF)(LPVOID, LPCTSTR, LPCTSTR, DWORD, LPVOID);

bool getURLToFile(string url, string file)
{
    int r = 1;
    HMODULE hDll;
    UDTF URLDownloadToFile;

    if((hDll = LoadLibrary("urlmon"))) // Loads the module (DLL) urlmon into this process
    {
        if((URLDownloadToFile = (UDTF)GetProcAddress(hDll, "URLDownloadToFileA"))) // Retrieves the function URLDownloadToFileA from the urlmon DLL
        {
            if(URLDownloadToFile(0, url.c_str(), file.c_str(), 0, 0) == 0) // Actual download happens here
            {
                r = 0; // Success!
            }
        }
        FreeLibrary(hDll); // Unload the module
    }
    return !r; // return True if r = 0
}

string getStringFromFile(string file_name)
{
    ifstream file(file_name); // Creates the file stream
    return { istreambuf_iterator<char>(file), istreambuf_iterator<char>{} };
}

set<string> extractLinks(string file_name)
{
    static const regex href_regex( "<a href=\"(.*?)\"", regex_constants::icase); // Creates the regex that parses <a> tags

    const string text = getStringFromFile(file_name); // Gets stored string

    return { sregex_token_iterator(text.begin(), text.end(), href_regex, 1), sregex_token_iterator{} }; // Returns the set of matched instances
}

int main(void)
{
    string url;
    string file = "sample.txt"; // File for temporary storage of web page
    cout<<"Please enter a url address: ";
    cin>>url;

    if(getURLToFile(url, file)) // Try to get the url
    {
        cout<<"The following links were found:"<<endl;
        for(string ref: extractLinks(file)) // Print all the links in the set
        {
            cout<<ref<<endl;
        }
        cout<<"Done!"<<endl;
    }
    else
    {
        cout<<"Could not fetch the url"<<endl;
    }

}


........................................................................CODE ENDS HERE................................................................................

Code Explanation:

  • Read url from the user
  • Download the page contents to a file
  • Create a string from that file
  • Parse the HTML using regex
  • Print the list of links

Regex explanation:

REGEX: <a href=\"(.*?)\"

Here <a href=\" means that the expression should start with <a href=". Note that we need to add an escape character to ". the \" in the end means that the expression should end with ". The middle part in the parenthesis () is what the regex will return. The dot(.) matches any character. * after a regex means that it can occur 0 or more times consecutively. So here .* means that after <a href=" , you can have any number of characters until you find the last ". Lastly, the ? means that you should stop when you get the first occurrence of "

Sample output:

Please enter a url address: https://web.whatsapp.com
The following links were found:
http://www.firefox.com
http://www.google.com/chrome/
http://www.opera.com
https://support.apple.com/downloads/#safari
https://www.microsoft.com/en-us/windows/microsoft-edge
Done!

If you have any doubts, feel free to comment and I'll be pleased to help!

Add a comment
Know the answer?
Add Answer to:
Link Extractor For this project you are to build a command-line tool that can extract the...
Your Answer:

Post as a guest

Your Name:

What's your source?

Earn Coins

Coins can be redeemed for fabulous gifts.

Not the answer you're looking for? Ask your own homework help question. Our experts will answer your question WITHIN MINUTES for Free.
Similar Homework Help Questions
  • Complete the Java command line application. The application accepts a URL from the command line. This...

    Complete the Java command line application. The application accepts a URL from the command line. This application should then make a HTTP request to “GET” the HTML page for that URL, then print the HTTP header as well as the HTML for the page to the console. You must use the Java “socket” class to do all network I/O with the webserver. Yes, I’m aware this is on Stack Overflow, but you must understand how this works, as you will...

  • Project Description In this project, you will be developing a multithreaded Web server and a simple...

    Project Description In this project, you will be developing a multithreaded Web server and a simple web client. The Web server and Web client communicate using a text-based protocol called HTTP (Hypertext Transfer Protocol). Requirements for the Web server The server is able to handle multiple requests concurrently. This means the implementation is multithreaded. In the main thread, the server listens to a specified port, e.g., 8080. Upon receiving an HTTP request, the server sets up a TCP connection to...

  • You need not run Python programs on a computer in solving the following problems. Place your...

    You need not run Python programs on a computer in solving the following problems. Place your answers into separate "text" files using the names indicated on each problem. Please create your text files using the same text editor that you use for your .py files. Answer submitted in another file format such as .doc, .pages, .rtf, or.pdf will lose least one point per problem! [1] 3 points Use file math.txt What is the precise output from the following code? bar...

  • For this week's lab, you will use two of the classes in the Java Collection Framework:...

    For this week's lab, you will use two of the classes in the Java Collection Framework: HashSet and TreeSet. You will use these classes to implement a spell checker. Set Methods For this lab, you will need to use some of the methods that are defined in the Set interface. Recall that if set is a Set, then the following methods are defined: set.size() -- Returns the number of items in the set. set.add(item) -- Adds the item to the...

  • Don't attempt if you can't attempt fully, i will dislike a nd negative comments would be...

    Don't attempt if you can't attempt fully, i will dislike a nd negative comments would be given Please it's a request. c++ We will read a CSV files of a data dump from the GoodReads 2 web site that contains information about user-rated books (e.g., book tit le, publication year, ISBN number, average reader rating, and cover image URL). The information will be stored and some simple statistics will be calculated. Additionally, for extra credit, the program will create an...

  • Don't attempt if you can't attempt fully, i will dislike and negative comments would be given...

    Don't attempt if you can't attempt fully, i will dislike and negative comments would be given Please it's a request. c++ We will read a CSV files of a data dump from the GoodReads 2 web site that contains information about user-rated books (e.g., book titnle, publication year, ISBN number, average reader rating, and cover image URL). The information will be stored and some simple statistics will be calculated. Additionally, for extra credit, the program will create an HTML web...

ADVERTISEMENT
Free Homework Help App
Download From Google Play
Scan Your Homework
to Get Instant Free Answers
Need Online Homework Help?
Ask a Question
Get Answers For Free
Most questions answered within 3 hours.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT