Question

Write a batch script, which combines a few tools in Linux to finish a big-data processing...

Write a batch script, which combines a few tools in Linux to finish a big-data processing task --- finding out most frequently used words on Wikipedia pages. The execution of the script generates a list of distinct words used in the wikipedia pages and the number of occurrences of each word on these web pages. The words are sorted by the number of occurrences in ascending order. The following is a sample of output generated for 4 Wikipedia pages. 126 that 128 by 133 as 149 or 160 for 164 is 189 on 191 from 345 to 375 advertising 443 a 473 and 480 in 677 of 1080 the Since there are a huge number of pages in Wikipedia, it is not realistic to analyze all of them in short time on one machine. In the project, you need to analyze all the pages for the Wikipedia entries with two capital letters. For example, the Wikipedia page for entry "AC" is https://en.wikipedia.org/wiki/AC . You can use the following command to download and save the page in AC.html: wget https://en.wikipedia.org/wiki/AC -O AC.html A HTML page has HTML tags, which should be removed before the analysis. (Open a .html file using vi and a web browser, and you will find the differences.) You can use lynx to extract the text content into a text file. For example, the following command extract the content for entry "AC" into AC.txt lynx -dump –nolist AC.html > AC.txt After the contents for all the required entries have been extracted, you need to find all the words using grep. You need to use a regular expression to guide grep to do the search. All the words found by grep should be saved into the same file, which is then used to find the most frequently used words. Note that you need to find distinct words and count the number of times that each distinct word appears in file. You may need sort, cut and uniq in this step. Read the man pages of sort, cut and uniq to understand how this can be achieved.

0 0
Add a comment Improve this question Transcribed image text
Answer #1

Please find the following steps to find the unique word count from AC.txt.

Step 1: Download the file from web.

wget https://en.wikipedia.org/wiki/AC -O AC.html

Step 2: Extract the content to AC.txt other than the html tags with the following commands.

lynx -dump –nolist AC.html > AC.txt

Step 3: Following Unix command can help us in find the unique words and its counts from the wiki AC page.

tr ' ' '\n' <AC.txt|sort |uniq -c|sort -bn

Sample Output:

3 is
3 ligament
3 link
3 political
3 series
3 service
3 term
3 this
3 used
3 uses
3 you
4 current
4 links
4 may
4 name
4 type
4 Wikipedia
4 with
5 pages
6 by
7 page
8 an
8 Other
9 or
12 +
13 for
16 to
23 and
35 in
35 of
41 the
53 a
173 *
1799

Add a comment
Know the answer?
Add Answer to:
Write a batch script, which combines a few tools in Linux to finish a big-data processing...
Your Answer:

Post as a guest

Your Name:

What's your source?

Earn Coins

Coins can be redeemed for fabulous gifts.

Not the answer you're looking for? Ask your own homework help question. Our experts will answer your question WITHIN MINUTES for Free.
Similar Homework Help Questions
  • The URL is part of the assignment, it is a web scraper in Wiki. A' Read...

    The URL is part of the assignment, it is a web scraper in Wiki. A' Read aloud A Big-Data Processing Task: We need to find out 15 most frequently used words on a set of Wikipedia pages. Specifically, we need to find out a list of these words and the number of occurrences of each word on the pages. The list should be sorted in descending order based on the number of occurrences. The following is a sample of output...

  • The first script is validate.sh. This is a simple form validation script that will be used...

    The first script is validate.sh. This is a simple form validation script that will be used to verify the inputs given. Normally, this would be done to validate input from a website or another program before entry into a database or other record storage. In this case, we will keep it simple and only focus on the input validation step. In particular, the script should prompt the user for four values: first name, last name, zip code, and email address....

  • I will like to compare automobile producers. This assignment suppose to read data like div tags...

    I will like to compare automobile producers. This assignment suppose to read data like div tags a etc. And count occurrence of them. Reading from a URL while working with an API (using Mediawiki API as an example) Input: Will be obtained from a URL using Mediawiki API -- starter code below Output: Up to you... sort of. What to submit: Upload a report (.pdf preferred) containing screenshots of code, output, and discussion/conclusions to d2l dropbox. Please also submit your...

  • Writing Unix Utilities in C (not C++ or C#) my-cat The program my-cat is a simple...

    Writing Unix Utilities in C (not C++ or C#) my-cat The program my-cat is a simple program. Generally, it reads a file as specified by the user and prints its contents. A typical usage is as follows, in which the user wants to see the contents of my-cat.c, and thus types: prompt> ./my-cat my-cat.c #include <stdio.h> ... As shown, my-cat reads the file my-cat.c and prints out its contents. The "./" before the my-cat above is a UNIX thing; it...

  • 1 Overview and Background Many of the assignments in this course will introduce you to topics in ...

    1 Overview and Background Many of the assignments in this course will introduce you to topics in computational biology. You do not need to know anything about biology to do these assignments other than what is contained in the description itself. The objective of each assignment is for you to acquire certain particular skills or knowledge, and the choice of topic is independent of that objective. Sometimes the topics will be related to computational problems in biology, chemistry, or physics,...

  • the is my HTML and CSS code. I didn't attach the image or the CSS file...

    the is my HTML and CSS code. I didn't attach the image or the CSS file because of the space limit. two things I wanted to incorporate on this page are to make "MY TOP 3 MUSIC GENRES" blink using javascript and to make the video links autoplay using javascript both should be using javascript requirement: our professor told us not to use <div> or alert. thank you for your help <!DOCTYPE html> <!-- I used w3school for the formatting...

  • Don't attempt if you can't attempt fully, i will dislike a nd negative comments would be...

    Don't attempt if you can't attempt fully, i will dislike a nd negative comments would be given Please it's a request. c++ We will read a CSV files of a data dump from the GoodReads 2 web site that contains information about user-rated books (e.g., book tit le, publication year, ISBN number, average reader rating, and cover image URL). The information will be stored and some simple statistics will be calculated. Additionally, for extra credit, the program will create an...

  • Don't attempt if you can't attempt fully, i will dislike and negative comments would be given...

    Don't attempt if you can't attempt fully, i will dislike and negative comments would be given Please it's a request. c++ We will read a CSV files of a data dump from the GoodReads 2 web site that contains information about user-rated books (e.g., book titnle, publication year, ISBN number, average reader rating, and cover image URL). The information will be stored and some simple statistics will be calculated. Additionally, for extra credit, the program will create an HTML web...

  • CMPS 12B Introduction to Data Structures Programming Assignment 2 In this project, you will write...

    can i get some help with this program CMPS 12B Introduction to Data Structures Programming Assignment 2 In this project, you will write a Java program that uses recursion to find all solutions to the n-Queens problem, for 1 Sns 15. (Students who took CMPS 12A from me worked on an iterative, non-recursive approach to this same problem. You can see it at https://classes.soe.ucsc.edu/cmps012a/Spring l8/pa5.pdf.) Begin by reading the Wikipcdia article on the Eight Queens puzzle at: http://en.wikipedia.org/wiki/Eight queens_puzzle In...

ADVERTISEMENT
Free Homework Help App
Download From Google Play
Scan Your Homework
to Get Instant Free Answers
Need Online Homework Help?
Ask a Question
Get Answers For Free
Most questions answered within 3 hours.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT