Link Extractor For this project you are to build a command-line tool that can extract the...

Question

Question

Link Extractor For this project you are to build a command-line tool that can extract the...

Link Extractor

For this project you are to build a command-line tool that can extract the links from any URL on the web.
The tool should be implemented in standard C++, although it is perfectly reasonable to consume API's
specific to the operating system you are building on since standard C++ doesn't provide native support for
networking.
The tool should have the following behavior when run from the command line (text doesn't have to match
exactly, the example just gives a general idea of the functionality):
> .\tool.exe
> Please enter a url address: http://example.com
> The following links were found:
> http://example.com/page1
> http://example.com/page2
> http://google.com/search
> http://mdn.com/asdf
> Done!
In order to achieve this, you will need to design a program so that it is able to read the url entered at the
console, download the contents of that url over the network, and parse the HTML result for anchor <a> tags
containing href elements.
The program should be robust, and not crash, even when given an invalid url, or when given a url that points
to something other than HTML. If the program is not able to parse html from the contents returned from the
internet, it is fine to silently abort, or print an error message, but crashing is not acceptable.
The program should be designed well. You should take advantage of classes, containers, and modern C++
algorithms.

engineering Computer-Science

Add a comment Improve this question Transcribed image text

Answer 1

Answer #1

There is a lot of boiler plate code in the function getURLToFile as C++ does not have native support for networking. Also this will only work on windows. Please find the commented code below:

................................................................CODE STARTS HERE.......................................................................................

#include <windows.h>
#include <fstream>
#include <iostream>
#include <set>
#include <regex>
using namespace std;

typedef HRESULT (WINAPI *UDTF)(LPVOID, LPCTSTR, LPCTSTR, DWORD, LPVOID);

bool getURLToFile(string url, string file)
{
    int r = 1;
    HMODULE hDll;
    UDTF URLDownloadToFile;

    if((hDll = LoadLibrary("urlmon"))) // Loads the module (DLL) urlmon into this process
    {
        if((URLDownloadToFile = (UDTF)GetProcAddress(hDll, "URLDownloadToFileA"))) // Retrieves the function URLDownloadToFileA from the urlmon DLL
        {
            if(URLDownloadToFile(0, url.c_str(), file.c_str(), 0, 0) == 0) // Actual download happens here
            {
                r = 0; // Success!
            }
        }
        FreeLibrary(hDll); // Unload the module
    }
    return !r; // return True if r = 0
}

string getStringFromFile(string file_name)
{
ifstream file(file_name); // Creates the file stream
return { istreambuf_iterator<char>(file), istreambuf_iterator<char>{} };
}

set<string> extractLinks(string file_name)
{
static const regex href_regex( "<a href=\"(.*?)\"", regex_constants::icase); // Creates the regex that parses <a> tags

const string text = getStringFromFile(file_name); // Gets stored string

return { sregex_token_iterator(text.begin(), text.end(), href_regex, 1), sregex_token_iterator{} }; // Returns the set of matched instances
}

int main(void)
{
    string url;
    string file = "sample.txt"; // File for temporary storage of web page
    cout<<"Please enter a url address: ";
    cin>>url;

    if(getURLToFile(url, file)) // Try to get the url
    {
        cout<<"The following links were found:"<<endl;
        for(string ref: extractLinks(file)) // Print all the links in the set
        {
            cout<<ref<<endl;
        }
        cout<<"Done!"<<endl;
    }
    else
    {
        cout<<"Could not fetch the url"<<endl;
    }

}

........................................................................CODE ENDS HERE................................................................................

Code Explanation:

Read url from the user
Download the page contents to a file
Create a string from that file
Parse the HTML using regex
Print the list of links

Regex explanation:

REGEX: <a href=\"(.*?)\"

Here <a href=\" means that the expression should start with <a href=". Note that we need to add an escape character to ". the \" in the end means that the expression should end with ". The middle part in the parenthesis () is what the regex will return. The dot(.) matches any character. * after a regex means that it can occur 0 or more times consecutively. So here .* means that after <a href=" , you can have any number of characters until you find the last ". Lastly, the ? means that you should stop when you get the first occurrence of "

Sample output:

Please enter a url address: https://web.whatsapp.com
The following links were found:
http://www.firefox.com
http://www.google.com/chrome/
http://www.opera.com
https://support.apple.com/downloads/#safari
https://www.microsoft.com/en-us/windows/microsoft-edge
Done!

If you have any doubts, feel free to comment and I'll be pleased to help!

Add a comment

Answer 2

Link Extractor For this project you are to build a command-line tool that can extract the...

Homework Answers

Add Answer to:
Link Extractor For this project you are to build a command-line tool that can extract the...

Post as a guest

Earn Coins

Complete the Java command line application. The application accepts a URL from the command line. This...

Project Description In this project, you will be developing a multithreaded Web server and a simple...

You need not run Python programs on a computer in solving the following problems. Place your...

For this week's lab, you will use two of the classes in the Java Collection Framework:...

Don't attempt if you can't attempt fully, i will dislike a nd negative comments would be...

Don't attempt if you can't attempt fully, i will dislike and negative comments would be given...

Link Extractor For this project you are to build a command-line tool that can extract the...

Homework Answers

Add Answer to: Link Extractor For this project you are to build a command-line tool that can extract the...

Post as a guest

Earn Coins

Complete the Java command line application. The application accepts a URL from the command line. This...

Project Description In this project, you will be developing a multithreaded Web server and a simple...

You need not run Python programs on a computer in solving the following problems. Place your...

For this week's lab, you will use two of the classes in the Java Collection Framework:...

Don't attempt if you can't attempt fully, i will dislike a nd negative comments would be...

Don't attempt if you can't attempt fully, i will dislike and negative comments would be given...

Add Answer to:
Link Extractor For this project you are to build a command-line tool that can extract the...