In this assignment, you will explore more on text analysis and an elementary version of sentiment...

Question

Question

In this assignment, you will explore more on text analysis and an elementary version of sentiment...

In this assignment, you will explore more on text analysis and an elementary version of sentiment analysis. Sentiment analysis is the process of using a computer program to identify and categorise opinions in a piece of text in order to determine the writer’s attitude towards a particular topic (e.g., news, product, service etc.). The sentiment can be expressed as positive, negative or neutral.

Create a Python file called a5.py that will perform text analysis on some text files. You can assume that the words of the text files are separated by spaces (‘ ’) and newline characters (‘\n’). You can also assume that some punctuation symbols can be present in the files, e.g., ‘.’, ’,’. ‘?’, ‘!’. You will find some sample text files (covid1.txt, covid2.txt and covid3.txt) and two files containing sequences of ‘positive’ and ‘negative’ words, respectively, available in the assignment folder.

You must implement and test the following functions inside of your a5.py file:

1) load_datafile(filename) – Takes a single string argument, representing a filename. This is the text file that you will analyze. We will call it datafile.

2) load_wordfile(filename) - Takes a single string argument, representing another filename. This file contains a sequence of positive or negative words. You will use these words to analyze the sentiment of the datafile.

These two functions must open the file and parse the text inside. This function should initialize all necessary variables (e.g., lists, dictionaries, other variables) required by your algorithm to store data to solve the remainder of the problems. These variables should be created globally outside these functions to make them accessible for other functions that you will define later. This is to make sure that the text files are parsed once and other functions can be executed multiple times without the need to parse these files again. These functions should also remove any information stored from a previous file when they are called (i.e., you initialize these variables every time these functions are called). You must also handle exceptions in case the files are not available.

Now add the following functions to your program (not necessarily in the given order).

3) remove_punc(list) – takes a list of punctuation symbols and remove these symbols from the content of the datafile and return the content.

4) basic_analysis( ) – does not take any input parameters and returns the following statistics for the datafile – the total number of characters, the total number of words and the total number of characters ignoring spaces and punctuation symbols that occur in the datafile.

5) remove_keywords(list) – takes a list of keywords (strings) and remove these keywords from the content of the datafile and return the content. Each of these keywords must match the whole words only and the keywords are not case-sensitive.

6) count_unique( ) – does not take any input parameters and returns total number of unique words present in the datafile.

7) letter_starting_most_unique_words( ) – does not take any input parameters and returns the letter that starts maximum number of unique words in the datafile.

8) common_words( ) – does not take any input parameters and returns the list of the most common words in datafile and their frequency. If there is just one word that occurs most commonly then it should return a list like [word]. If there are more than one words that occur most number of times, then the list must store all those words in a list like [word1, word2, ..].

9) topx_commonwords(x) – takes an integer x as the input parameter and returns a list data structure that should have the following format [[common words, freq],[ ],..], Page 3 of 4 where at index i, with 0 ≤ i ≤ x-1 (0 means most, x-1 means least), it stores the list of i’th common words that occur in the datafile and their frequencies (how many times they occur). If there is just one word that occurs i’th most commonly then it should be stored in a list like [[word], freq]. If there is a tie at some level i, i.e., there are more than one common words that occur i’th most times, then the list must store all those words like [[word1, word2, ..], freq].

10) word_pairs(string) – takes a string as input parameter that represents the first word of a pair of words. This function should return all words that follow the given string argument. If the argument word does not appear in the text at all, or is never followed by another word (i.e., is the last word in the file), this function should return None.

11) freq_keyword(list) - takes a single list as input parameter that contains string values.

The function should operate as follows:

a. If the list is empty or none of the words specified in the list occur in the datafile, the function should return None.

b. Otherwise, the function should return the word from the list that occurs most frequently in the datafile - or any one of the most common, in the case of a tie.

12) sentiment_analysis() – does not take any input parameter and analyzes the sentiment of the datafile as follows. It removes punctuation marks and keywords (e.g., common articles, prepositions, conjunctions etc.) from the datafile and then it counts the total number of ‘positive’ words (pos_count) and total number of ‘negative’ words (neg_count) that occur in the datafile. If pos_count is greater than neg_count it returns ‘Positive sentiment’, otherwise it returns ‘Negative sentiment’, in case of a tie it returns ‘Neutral sentiment’.

Bonus: Sentiment analysis is not just comparing total number of positive words with that of negative words. Among many other factors, sentiment analysis is highly context-sensitive. For example, the word ‘promising’ can be considered as a positive word. But if the same word is preceded by a negative word like “not promising” then it represents a negative sentiment. Propose one or more rules that can be implemented for sentiment analysis by using any combination of other functions defined in this assignment. You can also propose better rules to count the total number of positive words and negative words that occur in the datafile. As always, you are encouraged to write helper functions.

You can use the analysis_tester.py and the files, available in the folder, to test your functions

THE LANGUAGE IS PYTHON

engineering Computer-Science

Add a comment Improve this question Transcribed image text

Answer 1

Answer #1

"""
Answering parts 1-6
"""
import sys
datafile='covid1.txt'
datafile_data=[] #list of all words in the datafile
datafile_temp=[] # to store the datafiles with punctutations

punc_list=['.', ',', '?', '!']
key_list=['IS','AN','the','and']
wordfile_data=[]
##since there is ambiguity on whether this file would contain positive or negative words so
##assuming there are two files first containing the positive and second the negative words a we load them
## every time we load a wordfile

def load_datafile(filename):
    datafile_data.clear()
    try:
        file = open(filename,"r")#Repeat for each line in the text file
        for line in file:
            fields = line.strip('\n').split(' ')
            #print(fields)
            datafile_data.extend(fields)
            #print(datafile_data)
        #datafile_data=[x.strip('\n') for x in open(filename)]
        #filename.close()
        datafile_temp.extend(datafile_data)
    except:
        print("Oops!",sys.exc_info()[0],"occured.")


def load_wordfile(filename):
    wordfile_data.clear()
    try:
        file = open(filename,"r")#Repeat for each line in the text file
        for line in file:
            fields = line.strip('\n').split(' ')
            #print(fields)
            wordfile_data.extend(fields)
            #print(datafile_data)
    except:
        print("Oops!",sys.exc_info()[0],"occured.")

def remove_punc(list):
    temp=datafile_data
    temp=[''.join(x for x in par if x not in list) for par in temp]
    datafile_data.clear()
    datafile_data.extend(temp)

def basic_analysis( ):
    """
    does not take any input parameters and returns the following statistics for the datafile –
    the total number of characters, the total number of words and
    the total number of characters ignoring spaces and punctuation symbols that occur in the datafile
    """
    count_chars=0
    count_clean_chars=0
    count_words=len(datafile_data)
    for x in datafile_temp:
        count_chars+=len(x)
    for y in datafile_data:
         count_clean_chars+=len(y)

    return {"Number of characters":count_chars,"Number of words":count_words,"Number of characters without punctations":count_clean_chars}

def remove_keywords(list):
    temp=datafile_data
    for x in list:
        list.remove(x)
        x=x.lower()
        list.append(x)
    for word in temp:
        if word.lower() in list: # case insensitive
            temp.remove(word)
    return temp

def count_unique( ):
    unique_words = set(datafile_data)
    unique_word_count = len(unique_words)
    return {"Total unique words":unique_word_count}

##loading data
load_datafile(datafile)
#load_wordfile('Positive.txt')
remove_punc(punc_list)
basic_an=basic_analysis()
print(basic_an)
datafile_data=remove_keywords(key_list)
print(count_unique())

#parts 7-8

#does not take any input parameters and returns the letter that starts maximum number of unique words in the datafile.

def letter_starting_most_unique_words( ):
    unique_words = list(set(datafile_data))
    unique_words = [i for i in unique_words if i] #dropping null values
    letter_starting = [w[0] for w in unique_words]
    letter_starting_freq = [letter_starting.count(p) for p in letter_starting]
    letter_starting_freq=dict(list(zip(letter_starting,letter_starting_freq)))
    itemMaxValue = max(letter_starting_freq.items(), key=lambda x: x[1])
    listOfKeys = list()
    for key, value in letter_starting_freq.items():
            if value == itemMaxValue[1]:
                listOfKeys.append((key,value))
    return {"Letter starting most unique words": listOfKeys } #prints if there is more than on character at max value

def common_words( ):
    word_freq = [datafile_data.count(p) for p in datafile_data]
    word_freq=dict(list(zip(datafile_data,word_freq)))
    itemMaxValue = max(word_freq.items(), key=lambda x: x[1])
    listOfKeys = list()
    for key, value in word_freq.items():
            if value == itemMaxValue[1]:
                listOfKeys.append((key,value))
            #print(word_freq)
    #get frequnecy of each word
    return {"Most common words": listOfKeys }

Sample output:

Code snapshot:

UULI RUTIWITI DULUMCILLILEO UN WCI 3/ 200 J2020 {'Number of characters': 48, 'Number of words': 12, "Number of characters without punctations': 43} {'Total unique words': 8} In [64]: IPython console History log Permissions: RW E nd-of-lines: CRL Encoding: UTF-8 Line: 78 Column: 16 Memory: 92%

1 # -*- coding: utf-8- 3 Created on Thu Mar 26 12:20:14 2020 5 @author: Ashwini 7 import sys 8 datafile='covidi.txt' 9 datafile_data=[] #List of all words in the datafile 10 datafile_temp=[] # to store the datafiles with punctutations 20 12 punc_list=['.', ';', '?', '!'] 13 key_list=['IS', 'AN', 'the', 'and'] 14 wordfile_data=[] 15 ##since there is ambiguity on whether this file would contain positive or negative words so 16 ##assuming there are two files first containing the positive and second the negative words a we Load them 17 ## every time we Load a wordfile 18 19 def load_datafile(filename): datafile_data.clear() try: file = open(filename, "r")#Repeat for each Line in the text file for line in file: fields = line.strip('\n').split(' ') #print(fields) datafile_data.extend(fields) #print(datafile_data) #datafile_data=[x.strip ('\n') for x in open(filename)] #filename. close datafile_temp. extend(datafile_data) except: print("Oops!", sys.exc_info[@],"occured.") def load_wordfile(filename): wordfile_data.clear() try: file = open(filename, "r")#Repeat for each Line in the text file for line in file: fields = line.strip('\n').split(' ') #print(fields) wordfile_data.extend(fields) #print(datafile_data) except: print("Oops!", sys.exc_info[@],"occured.") # 47 def remove_punc(list): temp=datafile_data 49 temp=["'.join(x for x in par if x not in list) for par in temp] datafile_data.clear() datafile_data.extend(temp)

Add a comment

Answer 2

In this assignment, you will explore more on text analysis and an elementary version of sentiment...

Homework Answers

Add Answer to:
In this assignment, you will explore more on text analysis and an elementary version of sentiment...

Post as a guest

Earn Coins

Your task is to process a file containing the text of a book available as a...

Homework IX-a Write a program that opens a text file, whose name you enter at the keyboard You wi...

Write a C program to run on ocelot to read a text file and print it...

Assignment 3: Word Frequencies Prepare a text file that contains text to analyze. It could be...

Python program This assignment requires you to write a single large program. I have broken it...

//I NEED THE PROGRAM IN C LANGUAGE!// QUESTION: I need you to write a program which...

All the white space among words in a text file was lost. Write a C++ program...

Python 3.7 Coding assignment This Program should first tell users that this is a word analysis...

In this lab you will write a spell check program. The program has two input files:...

CSC110 Lab 6 (ALL CODING IN JAVA) Problem: A text file contains a paragraph. You are to read the contents of the file, store the UNIQUEwords and count the occurrences of each unique word. When the fil...

In this assignment, you will explore more on text analysis and an elementary version of sentiment...

Homework Answers

Add Answer to: In this assignment, you will explore more on text analysis and an elementary version of sentiment...

Post as a guest

Earn Coins

Add Answer to:
In this assignment, you will explore more on text analysis and an elementary version of sentiment...