Question

There are three documents and you need to return the top 3 important words in each...

There are three documents and you need to return the top 3 important words in each document with their TF-IDF scores. Ignore all punctuations. Documents document1.dat, document2.dat, document3.dat

in python

i have to be able to run a program in python the opens 3 documents and outputs the 3 words that are used the most among the documents along with calculate their tf-idf scores

0 0
Add a comment Improve this question Transcribed image text
Answer #1
from collections import Counter
from heapq import nlargest
import math
#THIS FUNCTION RETURNS A DICTIONARY OF EACH FREQUENCY OF WORD
def wordfrequency(fname):
    l = [i.strip().split() for i in open(fname).readlines()]
    d=dict()
    for i in l:
        for j in i:
            if j not in d:
                d[j]=1
            else:
                d[j]+=1
    return d
#THIS FUNCTION RETURNS A LIST CONTAINING THE 3 LARGEST OCCURING WORDS
def threelargest(dct):
    res = nlargest(3, dct, key = dct.get)
    return res
#THIS FUNCTION RETURNS A TF-IDF OF A WORD
def calculate_tf_idf(l,dct,total_no_of_words,df):

    for i in dct:
        total_no_of_words+=dct[i]
    tf=dct[l]/total_no_of_words
    idf=math.log10(3.0/df)

    return (tf*idf)

d1=wordfrequency("f1.dat")
d2=wordfrequency("f2.dat")
d3=wordfrequency("f3.dat")
l1=threelargest(d1)
l2=threelargest(d2)
l3=threelargest(d3)
print("3 most occured word in file 1"+str(l1))
print("3 most occured word in file 2"+str(l2))
print("3 most occured word in file 3"+str(l3))
dt1={}
t1=0
for i in d1:
    t1+=d1[i]
for i in l1:
    df=0
    if i in d1:
        df+=1
    if i in d2:
        df+=1
    if i in d3:
        df+=1
    dt1[i]=calculate_tf_idf(i,d1,t1,df)
print("tfidf of 3 most occured words in file 1 "+str(dt1))
dt2={}
t2=0
for i in d2:
    t2+=d2[i]
for i in l2:
    df=0
    if i in d1:
        df+=1
    if i in d2:
        df+=1
    if i in d3:
        df+=1
    dt2[i]=calculate_tf_idf(i,d2,t2,df)
print("tfidf of 3 most occured words in file 2 "+str(dt2))
dt3={}
t3=0
for i in d3:
    t3+=d3[i]
for i in l3:
    df=0
    if i in d1:
        df+=1
    if i in d2:
        df+=1
    if i in d3:
        df+=1
    dt3[i]=calculate_tf_idf(i,d3,t3,df)
print("tfidf of 3 most occured words in file 3 "+str(dt3))
#d1,d2,d3 contains dictionary of words with frequency
#t1,t2,t3 contains total no of words in d1,d2,d3
#l1,l2,l3 has 3 mlargest occuring words in d1,d2,d3
#dt1,dt2,dt3 has tf-idf for l1,l2,,l3
#df contains the no of documents the word occur

There are dat files f1,f2,f3

OUTPUT

C1. C:\WINDOWS\system32\cmd.exe C:\Users\Lenovo\Desktop>python u1.py 3 most occured word in file 1[-0.06, 0.12, -0.99]GIVE IT A LIKE IF YOU FIND IT USEFUL,THANKS!!!

Add a comment
Know the answer?
Add Answer to:
There are three documents and you need to return the top 3 important words in each...
Your Answer:

Post as a guest

Your Name:

What's your source?

Earn Coins

Coins can be redeemed for fabulous gifts.

Not the answer you're looking for? Ask your own homework help question. Our experts will answer your question WITHIN MINUTES for Free.
Similar Homework Help Questions
  • Python Programming (Just need the Code) Index.py #Python 3.0 import re import os import collections import...

    Python Programming (Just need the Code) Index.py #Python 3.0 import re import os import collections import time #import other modules as needed class index:    def __init__(self,path):    def buildIndex(self):        #function to read documents from collection, tokenize and build the index with tokens        # implement additional functionality to support methods 1 - 4        #use unique document integer IDs    def exact_query(self, query_terms, k):    #function for exact top K retrieval (method 1)    #Returns...

  • You need not run Python programs on a computer in solving the following problems. Place your...

    You need not run Python programs on a computer in solving the following problems. Place your answers into separate "text" files using the names indicated on each problem. Please create your text files using the same text editor that you use for your .py files. Answer submitted in another file format such as .doc, .pages, .rtf, or.pdf will lose least one point per problem! [1] 3 points Use file math.txt What is the precise output from the following code? bar...

  • Pseudocode & Python In this portion of the project you will analyze a problem and create...

    Pseudocode & Python In this portion of the project you will analyze a problem and create a Python program to solve it. In recent years, there has been more attention to Water Quality. Specifically the amount of lead that is found in our daily water supply. Lead and copper residue start to show up in our water supply due to aging pipes. This residue can be found in older homes that have not replaced their pipes and in the municipal...

  • Read this article. Then write a 250 word response on two of the programs you like...

    Read this article. Then write a 250 word response on two of the programs you like the most. Open source business intelligence software 1. BIRT BIRT is an open source BI program that CloudTweaks says is often viewed as the industry standard. BIRT boasts “over 12 million downloads and over 2.5 million developers across 157 countries.” Its users include heavyweights such as Cisco, S1, and IBM (which is also a BIRT sponsor). They also have maturity going for them, as...

  • Starbucks after Schultz This activity is important because, as a manager, you must be able to...

    Starbucks after Schultz This activity is important because, as a manager, you must be able to identify your company’s core competency and select an appropriate business-level strategy to optimize its competitive value. The goal of this exercise is to demonstrate your understanding of core competency and business-level strategies by applying these concepts to Starbucks’ recent experience in identifying and regaining its competitive advantage. Read the case below and answer the questions that follow. Case Inspired by Italian coffee bars, Starbucks...

  • CMPS 12B Introduction to Data Structures Programming Assignment 2 In this project, you will write...

    can i get some help with this program CMPS 12B Introduction to Data Structures Programming Assignment 2 In this project, you will write a Java program that uses recursion to find all solutions to the n-Queens problem, for 1 Sns 15. (Students who took CMPS 12A from me worked on an iterative, non-recursive approach to this same problem. You can see it at https://classes.soe.ucsc.edu/cmps012a/Spring l8/pa5.pdf.) Begin by reading the Wikipcdia article on the Eight Queens puzzle at: http://en.wikipedia.org/wiki/Eight queens_puzzle In...

  • Please use own words. Thank you. CASE QUESTIONS AND DISCUSSION > Analyze and discuss the questions...

    Please use own words. Thank you. CASE QUESTIONS AND DISCUSSION > Analyze and discuss the questions listed below in specific detail. A minimum of 4 pages is required; ensure that you answer all questions completely Case Questions Who are the main players (name and position)? What business (es) and industry or industries is the company in? What are the issues and problems facing the company? (Sort them by importance and urgency.) What are the characteristics of the environment in which...

  • What would you recommend regarding the issue solution and give the best solution ? Case study...

    What would you recommend regarding the issue solution and give the best solution ? Case study to answer all these question is as follow : When Robert Foster arrived at Home Improvement Inc. in December 2000, the deck seemed stacked against the new CEO. He had no retailing experience and, in fact, had spent an entire career in industrial, not consumer, business. His previous job was running Standard Electric’s power systems division, whose multimillion-dollar generating plants for industry and governments...

  • Why are networks and industry relationships important to TWC? What other strategies could an eco-tourism business...

    Why are networks and industry relationships important to TWC? What other strategies could an eco-tourism business of this size use to source ideas and incorporate into its new product development strategy? Tasmanian Walking Company: Balancing luxury and adventure in a sustainable experience Gemma Lewis, PhD University of Tasmania, Australia the organic skincare range supplied by LITYA (Li'tya, 2016). Before introducing this new activity, TWC had to adapt certain treatments to ensure they maintained ocus on sustainable resaurce usage. Their outecor...

  • Below is the information: It is important to understand the different leadership styles employed by nursing...

    Below is the information: It is important to understand the different leadership styles employed by nursing leaders in healthcare organizations and to understand their significance on nursing practice and patient outcomes, for better or for worse. Objective: Read the articles from Nursing Standard (PDF) and Bradley University (PDF). In -250 words, formulate an opinion on the following: 1. Reflect on an occasion where you experienced ineffective leadership (doesn't have to be in the hospital). What behaviors did they display? What...

ADVERTISEMENT
Free Homework Help App
Download From Google Play
Scan Your Homework
to Get Instant Free Answers
Need Online Homework Help?
Ask a Question
Get Answers For Free
Most questions answered within 3 hours.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT