Question

In this assignment, you will explore more on text analysis and an elementary version of sentiment...

In this assignment, you will explore more on text analysis and an elementary version of sentiment analysis. Sentiment analysis is the process of using a computer program to identify and categorise opinions in a piece of text in order to determine the writer’s attitude towards a particular topic (e.g., news, product, service etc.). The sentiment can be expressed as positive, negative or neutral.

Create a Python file called a5.py that will perform text analysis on some text files. You can assume that the words of the text files are separated by spaces (‘ ’) and newline characters (‘\n’). You can also assume that some punctuation symbols can be present in the files, e.g., ‘.’, ’,’. ‘?’, ‘!’. You will find some sample text files (covid1.txt, covid2.txt and covid3.txt) and two files containing sequences of ‘positive’ and ‘negative’ words, respectively, available in the assignment folder.

You must implement and test the following functions inside of your a5.py file:

1) load_datafile(filename) – Takes a single string argument, representing a filename. This is the text file that you will analyze. We will call it datafile.

2) load_wordfile(filename) - Takes a single string argument, representing another filename. This file contains a sequence of positive or negative words. You will use these words to analyze the sentiment of the datafile.

These two functions must open the file and parse the text inside. This function should initialize all necessary variables (e.g., lists, dictionaries, other variables) required by your algorithm to store data to solve the remainder of the problems. These variables should be created globally outside these functions to make them accessible for other functions that you will define later. This is to make sure that the text files are parsed once and other functions can be executed multiple times without the need to parse these files again. These functions should also remove any information stored from a previous file when they are called (i.e., you initialize these variables every time these functions are called). You must also handle exceptions in case the files are not available.

Now add the following functions to your program (not necessarily in the given order).

3) remove_punc(list) – takes a list of punctuation symbols and remove these symbols from the content of the datafile and return the content.

4) basic_analysis( ) – does not take any input parameters and returns the following statistics for the datafile – the total number of characters, the total number of words and the total number of characters ignoring spaces and punctuation symbols that occur in the datafile.

5) remove_keywords(list) – takes a list of keywords (strings) and remove these keywords from the content of the datafile and return the content. Each of these keywords must match the whole words only and the keywords are not case-sensitive.

6) count_unique( ) – does not take any input parameters and returns total number of unique words present in the datafile.

7) letter_starting_most_unique_words( ) – does not take any input parameters and returns the letter that starts maximum number of unique words in the datafile.

8) common_words( ) – does not take any input parameters and returns the list of the most common words in datafile and their frequency. If there is just one word that occurs most commonly then it should return a list like [word]. If there are more than one words that occur most number of times, then the list must store all those words in a list like [word1, word2, ..].

9) topx_commonwords(x) – takes an integer x as the input parameter and returns a list data structure that should have the following format [[common words, freq],[ ],..], Page 3 of 4 where at index i, with 0 ≤ i ≤ x-1 (0 means most, x-1 means least), it stores the list of i’th common words that occur in the datafile and their frequencies (how many times they occur). If there is just one word that occurs i’th most commonly then it should be stored in a list like [[word], freq]. If there is a tie at some level i, i.e., there are more than one common words that occur i’th most times, then the list must store all those words like [[word1, word2, ..], freq].

10) word_pairs(string) – takes a string as input parameter that represents the first word of a pair of words. This function should return all words that follow the given string argument. If the argument word does not appear in the text at all, or is never followed by another word (i.e., is the last word in the file), this function should return None.

11) freq_keyword(list) - takes a single list as input parameter that contains string values.

The function should operate as follows:

a. If the list is empty or none of the words specified in the list occur in the datafile, the function should return None.

b. Otherwise, the function should return the word from the list that occurs most frequently in the datafile - or any one of the most common, in the case of a tie.

12) sentiment_analysis() – does not take any input parameter and analyzes the sentiment of the datafile as follows. It removes punctuation marks and keywords (e.g., common articles, prepositions, conjunctions etc.) from the datafile and then it counts the total number of ‘positive’ words (pos_count) and total number of ‘negative’ words (neg_count) that occur in the datafile. If pos_count is greater than neg_count it returns ‘Positive sentiment’, otherwise it returns ‘Negative sentiment’, in case of a tie it returns ‘Neutral sentiment’.

Bonus: Sentiment analysis is not just comparing total number of positive words with that of negative words. Among many other factors, sentiment analysis is highly context-sensitive. For example, the word ‘promising’ can be considered as a positive word. But if the same word is preceded by a negative word like “not promising” then it represents a negative sentiment. Propose one or more rules that can be implemented for sentiment analysis by using any combination of other functions defined in this assignment. You can also propose better rules to count the total number of positive words and negative words that occur in the datafile. As always, you are encouraged to write helper functions.

You can use the analysis_tester.py and the files, available in the folder, to test your functions

THE LANGUAGE IS PYTHON

0 0
Add a comment Improve this question Transcribed image text
Answer #1


"""
Answering parts 1-6
"""
import sys
datafile='covid1.txt'
datafile_data=[] #list of all words in the datafile
datafile_temp=[] # to store the datafiles with punctutations

punc_list=['.', ',', '?', '!']
key_list=['IS','AN','the','and']
wordfile_data=[]
##since there is ambiguity on whether this file would contain positive or negative words so
##assuming there are two files first containing the positive and second the negative words a we load them
## every time we load a wordfile

def load_datafile(filename):
    datafile_data.clear()
    try:
        file = open(filename,"r")#Repeat for each line in the text file
        for line in file:
            fields = line.strip('\n').split(' ')
            #print(fields)
            datafile_data.extend(fields)
            #print(datafile_data)
        #datafile_data=[x.strip('\n') for x in open(filename)]
        #filename.close()
        datafile_temp.extend(datafile_data)
    except:
        print("Oops!",sys.exc_info()[0],"occured.")
  
  
def load_wordfile(filename):
    wordfile_data.clear()
    try:
        file = open(filename,"r")#Repeat for each line in the text file
        for line in file:
            fields = line.strip('\n').split(' ')
            #print(fields)
            wordfile_data.extend(fields)
            #print(datafile_data)
    except:
        print("Oops!",sys.exc_info()[0],"occured.")
      
def remove_punc(list):
    temp=datafile_data
    temp=[''.join(x for x in par if x not in list) for par in temp]
    datafile_data.clear()
    datafile_data.extend(temp)

def basic_analysis( ):
    """
    does not take any input parameters and returns the following statistics for the datafile –
    the total number of characters, the total number of words and
    the total number of characters ignoring spaces and punctuation symbols that occur in the datafile
    """
    count_chars=0
    count_clean_chars=0
    count_words=len(datafile_data)
    for x in datafile_temp:
        count_chars+=len(x)
    for y in datafile_data:
         count_clean_chars+=len(y)
      
    return {"Number of characters":count_chars,"Number of words":count_words,"Number of characters without punctations":count_clean_chars}

def remove_keywords(list):
    temp=datafile_data
    for x in list:
        list.remove(x)
        x=x.lower()
        list.append(x)
    for word in temp:
        if word.lower() in list: # case insensitive
            temp.remove(word)
    return temp


def count_unique( ):
    unique_words = set(datafile_data)
    unique_word_count = len(unique_words)
    return {"Total unique words":unique_word_count}


##loading data
load_datafile(datafile)
#load_wordfile('Positive.txt')
remove_punc(punc_list)
basic_an=basic_analysis()
print(basic_an)
datafile_data=remove_keywords(key_list)
print(count_unique())

#parts 7-8

#does not take any input parameters and returns the letter that starts maximum number of unique words in the datafile.

def letter_starting_most_unique_words( ):
    unique_words = list(set(datafile_data))
    unique_words = [i for i in unique_words if i] #dropping null values
    letter_starting = [w[0] for w in unique_words]
    letter_starting_freq = [letter_starting.count(p) for p in letter_starting]
    letter_starting_freq=dict(list(zip(letter_starting,letter_starting_freq)))
    itemMaxValue = max(letter_starting_freq.items(), key=lambda x: x[1])
    listOfKeys = list()
    for key, value in letter_starting_freq.items():
            if value == itemMaxValue[1]:
                listOfKeys.append((key,value))
    return {"Letter starting most unique words": listOfKeys } #prints if there is more than on character at max value

def common_words( ):
    word_freq = [datafile_data.count(p) for p in datafile_data]
    word_freq=dict(list(zip(datafile_data,word_freq)))
    itemMaxValue = max(word_freq.items(), key=lambda x: x[1])
    listOfKeys = list()
    for key, value in word_freq.items():
            if value == itemMaxValue[1]:
                listOfKeys.append((key,value))
            #print(word_freq)
    #get frequnecy of each word
    return {"Most common words": listOfKeys }

Sample output:

Code snapshot:

UULI RUTIWITI DULUMCILLILEO UN WCI 3/ 200 J2020 {'Number of characters': 48, 'Number of words': 12, "Number of characters without punctations': 43} {'Total unique words': 8} In [64]: IPython console History log Permissions: RW E nd-of-lines: CRL Encoding: UTF-8 Line: 78 Column: 16 Memory: 92%

1 # -*- coding: utf-8- 3 Created on Thu Mar 26 12:20:14 2020 5 @author: Ashwini 7 import sys 8 datafile='covidi.txt' 9 datafile_data=[] #List of all words in the datafile 10 datafile_temp=[] # to store the datafiles with punctutations 20 12 punc_list=['.', ';', '?', '!'] 13 key_list=['IS', 'AN', 'the', 'and'] 14 wordfile_data=[] 15 ##since there is ambiguity on whether this file would contain positive or negative words so 16 ##assuming there are two files first containing the positive and second the negative words a we Load them 17 ## every time we Load a wordfile 18 19 def load_datafile(filename): datafile_data.clear() try: file = open(filename, "r")#Repeat for each Line in the text file for line in file: fields = line.strip('\n').split(' ') #print(fields) datafile_data.extend(fields) #print(datafile_data) #datafile_data=[x.strip ('\n') for x in open(filename)] #filename. close datafile_temp. extend(datafile_data) except: print("Oops!", sys.exc_info[@],"occured.") def load_wordfile(filename): wordfile_data.clear() try: file = open(filename, "r")#Repeat for each Line in the text file for line in file: fields = line.strip('\n').split(' ') #print(fields) wordfile_data.extend(fields) #print(datafile_data) except: print("Oops!", sys.exc_info[@],"occured.") # 47 def remove_punc(list): temp=datafile_data 49 temp=["'.join(x for x in par if x not in list) for par in temp] datafile_data.clear() datafile_data.extend(temp)

Add a comment
Know the answer?
Add Answer to:
In this assignment, you will explore more on text analysis and an elementary version of sentiment...
Your Answer:

Post as a guest

Your Name:

What's your source?

Earn Coins

Coins can be redeemed for fabulous gifts.

Not the answer you're looking for? Ask your own homework help question. Our experts will answer your question WITHIN MINUTES for Free.
Similar Homework Help Questions
  • Your task is to process a file containing the text of a book available as a...

    Your task is to process a file containing the text of a book available as a file as follows: A function GetGoing(filename) that will take a file name as a parameter. The function will read the contents of the file into a string. Then it prints the number of characters and the number of words in the file. The function also returns the content of the file as a Python list of words in the text file. A function FindMatches(keywordlist,...

  • Homework IX-a Write a program that opens a text file, whose name you enter at the keyboard You wi...

    Homework IX-a Write a program that opens a text file, whose name you enter at the keyboard You will be using the file text.txt to test your program. Print out all of the individual, unique words contained in the file, in alphabetical order Print out the number of unique words appearing in text.txt. Call your program: YourName-HwrklXa.py Make sure that your name appears as a comment at the beginning of the program as well as on the output display showing...

  • Write a C program to run on ocelot to read a text file and print it...

    Write a C program to run on ocelot to read a text file and print it to the display. It should optionally find the count of the number of words in the file, and/or find the number of occurrences of a substring, and/or take all the words in the string and sort them lexicographically (ASCII order). You must use getopt to parse the command line. There is no user input while this program is running. Usage: mywords [-cs] [-f substring]...

  • Assignment 3: Word Frequencies Prepare a text file that contains text to analyze. It could be...

    Assignment 3: Word Frequencies Prepare a text file that contains text to analyze. It could be song lyrics to your favorite song. With your code, you’ll read from the text file and capture the data into a data structure. Using a data structure, write the code to count the appearance of each unique word in the lyrics. Print out a word frequency list. Example of the word frequency list: 100: frog 94: dog 43: cog 20: bog Advice: You can...

  • Python program This assignment requires you to write a single large program. I have broken it...

    Python program This assignment requires you to write a single large program. I have broken it into two parts below as a suggestion for how to approach writing the code. Please turn in one program file. Sentiment Analysis is a Big Data problem which seeks to determine the general attitude of a writer given some text they have written. For instance, we would like to have a program that could look at the text "The film was a breath of...

  • //I NEED THE PROGRAM IN C LANGUAGE!// QUESTION: I need you to write a program which...

    //I NEED THE PROGRAM IN C LANGUAGE!// QUESTION: I need you to write a program which manipulates text from an input file using the string library. Your program will accept command line arguments for the input and output file names as well as a list of blacklisted words. There are two major features in this programming: 1. Given an input file with text and a list of words, find and replace every use of these blacklisted words with the string...

  • All the white space among words in a text file was lost. Write a C++ program...

    All the white space among words in a text file was lost. Write a C++ program which using dynamic programming to get all of the possible original text files (i.e. with white spaces between words) and rank them in order of likelihood with the best possible runtime. You have a text file of dictionary words and the popularity class of the word (words are listed from popularity 1-100 (being most popular words), 101-200, etc) - Input is a text file...

  • Python 3.7 Coding assignment This Program should first tell users that this is a word analysis...

    Python 3.7 Coding assignment This Program should first tell users that this is a word analysis software. For any user-given text file, the program will read, analyze, and write each word with the line numbers where the word is found in an output file. A word may appear in multiple lines. A word shows more than once at a line, the line number will be only recorded one time. Ask a user to enter the name of a text file....

  • In this lab you will write a spell check program. The program has two input files:...

    In this lab you will write a spell check program. The program has two input files: one is the dictionary (a list of valid words) and the other is the document to be spellchecked. The program will read in the words for the dictionary, then will read the document and check whether each word is found in the dictionary. If not, the user will be prompted to leave the word as is or type in a replacement word and add...

  • CSC110 Lab 6 (ALL CODING IN JAVA) Problem: A text file contains a paragraph. You are to read the contents of the file, store the UNIQUEwords and count the occurrences of each unique word. When the fil...

    CSC110 Lab 6 (ALL CODING IN JAVA) Problem: A text file contains a paragraph. You are to read the contents of the file, store the UNIQUEwords and count the occurrences of each unique word. When the file is completely read, write the words and the number of occurrences to a text file. The output should be the words in ALPHABETICAL order along with the number of times they occur and the number of syllables. Then write the following statistics to...

ADVERTISEMENT
Free Homework Help App
Download From Google Play
Scan Your Homework
to Get Instant Free Answers
Need Online Homework Help?
Ask a Question
Get Answers For Free
Most questions answered within 3 hours.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT