Question

Lab Exercise #15 Assignment Overview This lab exercise provides practice with Pandas data analysis library. Data...

Lab Exercise #15 Assignment Overview This lab exercise provides practice with Pandas data analysis library. Data Files We provide three comma-separated-value file, scores.csv , college_scorecard.csv, and mpg.csv. The first file is list of a few students and their exam grades. The second file includes data from 1996 through 2016 for all undergraduate degree-granting institutions of higher education. The data about the institution will help the students to make decision about the institution for their higher education such as student completion, debt and repayment, earnings, and more. The third file includes information about the fuel economy of a list of vehicles. In the Part A of the lab, we want to compare the execution time required to read the file, and to calculate the mean for a specified column. It’s better to practice with scores.csv that has only a few lines and then switch to the college_scorecard.csv to calculate the execution times. In Part B, we will use pandas exclusively using the mpg.csv. Pandas In this lab, we are using a few of the tools provided by the Pandas library. To read a CSV file, Pandas has a method called read_csv. It has many parameters, but in this lab we use only the filename. The method reads the CSV file into a DataFrame object. A DataFrame is a two dimensional tabular data structure. Here is an example using the scores.csv file: data_frame = pandas.read_csv("scores.csv") print(data_frame) First Last Exam1 Exam2 Exam3 Exam4 0 Grace Hopper 100 98 87 97 1 Donald Knuth 82 87 92 81 2 Adele Goldberg 94 96 90 91 3 Brian Kernighan 89 74 89 77 4 Barbara Liskov 87 97 81 85 You can get the column titles by data_frame.columns: Index(['First', 'Last', 'Exam1', 'Exam2', 'Exam3', 'Exam4'], dtype='object') You can also find the size of the table by data_frame.shape: (5, 6) Subsetting is the process of retrieving parts of a data frame. A data frame can be subset in a variety of ways, the most common of which involve selecting a range of rows and columns, or selecting columns by column label. To select a column or columns of a data frame, the command data_frame["column"] where data_frame is the DataFrame object and column is the name of the selected column. The column name is enclosed within single or double quotes. Alternatively, the command data_file.column_title can be used as long as the column name is not the same as an existing method. The data type of the output is a series, which can be thought of as a list with labels. For example data_file.Exam1 returns only exam 1 scores: 0 100 1 82 2 94 3 89 4 87 Name: Exam1, dtype: int64 If you have the column index, you can use it to extract the column title, then use it to index the data_frame to access only one column: data_frame[data_frame.columns[0]] 0 Grace 1 Donald 2 Adele 3 Brian 4 Barbara Name: First, dtype: object When you create a DataFrame, you have the option to add input to the ‘index’ argument, to make sure that you have the index that you desire. You can also reshape the dataframe after creation such that it has an index that you desire using the method set_index(). For example, data_frame = pandas.read_csv("scores.csv") data_frame.set_index('Last', inplace=True) print(data_frame) First Exam1 Exam2 Exam3 Exam4 Last Hopper Grace 100 98 87 97 Knuth Donald 82 87 92 81 Goldberg Adele 94 96 90 91 Kernighan Brian 89 74 89 77 Liskov Barbara 87 97 81 85 inplace means do not create a copy of it. It will instead modify it. Note the difference between the printed data frame above and in the previous page. You can find the commonly used methods below. These methods could be applied on a column by calling these methods for a specific column: DataFrame.method Description of output describe() Summary statistics for numerical columns head(), tail() First/last 5 rows in the DataFrame min(), max() Minimum/maximum of values in a numerical column mean(), median() Mean/median of values in a numerical column std() Standard deviation of values in a numerical column set_index() Set the DataFrame index using existing columns. For example, data_frame = pandas.read_csv("scores.csv") print("Median: ", data_frame['Exam1'].median()) print("Mean: ", data_frame['Exam1'].mean()) Median: 89.0 Mean: 90.0 Import Pandas and try the code above to experiment with reading and processing a CSV file. Part A – READING A CSV FILE AND PERFORMING SIMPLE OPERATIONS To find out how long your Python program is taking to execute some tasks, you can use the Time module. Time.time() returns the number of seconds since January 1970. If you call it before and after a function call, and subtract the values, you can find out how long it took the system to execute your function. Description 1. read_csv_1(filename): This function uses csv.reader to read the file and saves the file data in a list of lists, then returns it. 2. read_csv_2(filename): This function uses pandas.read_csv to read the CSV file and returns a DataFrame. 3. find_median_1(data, index): This function receives a list of lists as input and calculates and returns the median of a column index rounded to 2 decimals. To find the median, we first need to reorganize our data set in ascending order. Then the median is the value that coincides with the middle of the data set. If there are an even amount of items, then we take the average of the two values that would “surround” the middle. 4. find_median_2(data_frame, column_name): This function receives a DataFrame as input and calculates and returns the median of the column column_name using Pandas library rounded to 2 decimals. Use the .median() method.  Demonstrate your completed program to your TA. On-line students should submit the completed files (named “lab15a.py”) for grading via Mimir. They also should submit a text file describing their observation about how long the functions takes. Download the starter code ‘lab15a.py’. Write a program that reads the file ‘college_scorecard.csv’. In this lab, you should implement two functions for reading and loading the CSV into a data structure, and two functions for finding the median of column index 1( " OPEID" column). The goal is to compare the execution times together. You should implement the following functions and report how long they take to execute: Part B – EXTRACTING SUBSET OF DATA Download the starter code ‘lab15b.py’. Write a program that reads the file ‘mpg.csv’ using the pandas read_csv() method and read the contents of the file into a DataFrame. Using DataFrame methods, output the: • The subset of the first 5 rows of columns titled mpg and horsepower • The subset of the last 5 rows of the columns titled mpg, horsepower, model_year, and name • The median of the "acceleration" column • The US cars with the highest mpg (best fuel economy). The origin of the vehicle models is defined in the column "origin". The model of the cars are defined in the “name” column. In this lab, you should define the following functions. You should also create a function that finds the best fuel economy based on its origin and prints the name of the vehicle that has that fuel economy. 1. read_csv_2(filename): This function uses pandas.read_csv to read the CSV file and returns a DataFrame. 2. find_median_2(data_frame, column_name)median This function receives a DataFrame as input and calculates and returns a float which is the median of the column column_name using Pandas library rounded to 2 decimals. 3. find_highest_mpg(data_frame, country) -- > (car_name,high_mpg): This function receives a DataFrame and country as input and returns the model of the country cars (str) with the highest mpg and the highest mpg (float) using Pandas library. To select more than one column or make the output a data frame rather than a list, double brackets should be used. For example, data_frame[["column1", "column2", ...]] returns a data frame with column1, column2, ... included. For example, data_frame = pandas.read_csv("scores.csv") print(data_frame.loc([['Exam1', 'Exam2']]) Exam1 Exam2 0 100 98 1 82 87 2 94 96 3 89 74 4 87 97 Similar to lists, to select a row by position, the command data_frame[a:b] where a and b - 1 are the initial and final rows included in the output. The loc[] is used to select a range of rows and/or a subset of columns. For example, the following lines returns a dataframe containing the first 4 rows and the columns 'Exam1', 'Exam2'of the data frame data set as shown below. Note the 0:3 are the labels in the column index. data_frame = pandas.read_csv("scores.csv") print(data_frame.loc[0:3,['Exam1','Exam2']]) Exam1 Exam2 0 100 98 1 82 87 2 94 96 3 89 74 To extract a subset of a DataFrame based on the values in another column, you can define your row indices based on Boolean condition. For example, print(data_frame.loc[data_frame["mpg"] == 18,["mpg","name"]]) name chevrolet chevelle malibu plymouth satellite amc hornet amc matador 45 18.0 amc hornet sportabout (sw) mpg 0 18.0 2 18.0 16 18.0 37 18.0 48 18.0 76 18.0 97 18.0 99 18.0 100 18.0 107 18.0 111 18.0 135 18.0 plymouth satellite sebring 153 18.0 163 18.0 174 18.0 chevrolet nova plymouth fury ford pinto ford mustang volvo 145e (sw) plymouth valiant amc hornet ford maverick amc gremlin maxda rx3  Demonstrate your completed program to your TA. On-line students should submit the completed files (named “lab15b.py”) for grading via Mimir.

0 0
Add a comment Improve this question Transcribed image text
Answer #1

//Program for above-given case

//the code follows as

import time

import csv

import pandas as pd

def find_mean_1(data, index):

s = 0

nulls = 0

for row in data[1:]:

if row[index] == 'NULL':

nulls += 1

continue

s += int(row[index])

return (1.0*s)/(len(data)-1) # 1.0 is multiplied to ensure float value is returned

def read_csv_1(filename):

data = list()

index = 20 # any index could be chosen

f = open(filename, 'r')

reader = csv.reader(f)

for row in reader:

data.append(row)

find_mean_1(data, index)

def find_mean_2(dataframe, index):

return dataframe[dataframe.columns[index]].mean() # mean using dataframe

def read_csv_2(filename):

dataframe = pd.read_csv(filename)

index = 20 # any index could be chosen

find_mean_2(dataframe, index)

def main():

filename = 'college_scorecard.csv'

start_time_1 = time.time()

read_csv_1(filename)

end_time_1 = time.time()

print 'Time taken by method 1:', end_time_1-start_time_1, ' seconds'

start_time_2 = time.time()

read_csv_2(filename)

end_time_2 = time.time()

print 'Time taken by method 2:', end_time_2-start_time_2, ' seconds'

main()

//END of the Program

--------------------------------------------------------------------------------------------------

//Screenshot for a sample output:

--------------------------------------------------------------------------------------------------------------------------

// PLEASE GIVE POSITIVE RATING //

Add a comment
Know the answer?
Add Answer to:
Lab Exercise #15 Assignment Overview This lab exercise provides practice with Pandas data analysis library. Data...
Your Answer:

Post as a guest

Your Name:

What's your source?

Earn Coins

Coins can be redeemed for fabulous gifts.

Not the answer you're looking for? Ask your own homework help question. Our experts will answer your question WITHIN MINUTES for Free.
Similar Homework Help Questions
  • 22.39 LAB 13 C FALL 2019 Overview Demonstrate your ability to use pandas with functions Description...

    22.39 LAB 13 C FALL 2019 Overview Demonstrate your ability to use pandas with functions Description Write a program that reads data from an input file using a DataFrame and displays a subset of data using a method Provided Input Files An input file with nearly 200 rows of data about automobiles. The input file has the following format: mpg, cylinders, displacement, horsepower, weight, acceleration, model_year, origin, name 18,9,307,130,3504, 12, 70, usa, chevrolet chevelle malibu 15,8,350,165,3693, 11.5,70, usa, buick skylark...

  • 23.4 PROJECT 4: Using Pandas for data analysis and practice with error handling Overview In this...

    23.4 PROJECT 4: Using Pandas for data analysis and practice with error handling Overview In this project, you will use the Pandas module to analyze some data about some 20th century car models, country of origin, miles per gallon, model year, etc. Provided Input Files An input file with nearly 200 rows of data about automobiles. The input file has the following format (the same as what you had for your chapter 13 labs). The following is an example of...

  • (a) Load the data file data/tips.csv into a pandas DataFrame called tips_df using the pandas read_table()...

    (a) Load the data file data/tips.csv into a pandas DataFrame called tips_df using the pandas read_table() function. Check the first five rows. (b) Create a new dataframe called tips by randomly sampling 6 records from the dataframe tips_df. Refer to the sample() function documentation. (c) Add a new column to tips called idx as a list ['one', 'two', 'three', 'four', 'five', 'six'] and then later assign it as the index of tips dataframe. Display the dataframe. (d) Create a new...

  • Python Assignment In this assignment, you will use Pandas library to perform analysis on the dataset stored in the following csv file: breast-cancer-wisconsin.csv. Please write script(s) to do the fol...

    Python Assignment In this assignment, you will use Pandas library to perform analysis on the dataset stored in the following csv file: breast-cancer-wisconsin.csv. Please write script(s) to do the following: 1. Read the csv file and covert the dataset into a DataFrame object. 2. Persist the dataset into a SQL table and a JASON file. • Write the content of the DataFrame object into an SQLite database table. This will convert the dataset into a SQL table format. You can...

  • 23.4 Project 4: Using Pandas for data analysis and practice with error handling Python Please! 23.4...

    23.4 Project 4: Using Pandas for data analysis and practice with error handling Python Please! 23.4 PROJECT 4: Using Pandas for data analysis and practice with error handling Overview In this project, you will use the Pandas module to analyze some data about some 20th century car models, country of origin, miles per gallon, model year, etc. Provided Input Files An input file with nearly 200 rows of data about automobiles. The input file has the following format (the same...

  • I am working on a data frame using pandas with some of the column names (PCTFLOAN, SATMTMID, STAT...

    I am working on a data frame using pandas with some of the column names (PCTFLOAN, SATMTMID, STATE, INSTITUTION_NAME). Some explanation, Column name STATE has state abbreviations for each school in that particular state. a. Data grouping. For each state in dataframe, find the 5 institutes that have the lowest loanpercentage (PCTFLOAN). Ignore all the missing values. b. Data summarizing. For each state in dataframecalculate the average of the median SAT math scores (SATMTMID) for the 5 low loan institutes...

  • You have just been hired as an analyst for an investment firm. Your first assignment is...

    You have just been hired as an analyst for an investment firm. Your first assignment is to analyze data for stocks in the S&P 500. The S&P 500 is a stock index that contains the 500 largest publicly traded companies. You have been given two sources of data to work with. The first is an XML file that contains the Symbol (ticker), company name, sector, and industry for every stock in the S&P 500, as of summer 2016. The second...

  • A csv file called COS-206_gradebook.csv is provided for this project (see Course Documents). This file contains...

    A csv file called COS-206_gradebook.csv is provided for this project (see Course Documents). This file contains grades data for 17 students on 20 assessments. These assessments include quizzes, homework assignments, term projects, and tests.First you are strongly encouraged to open this file in Excel to gain an overview of the data. Note the second row contains point totals for the assessments. For instance, the point total for hw0 (Homework 0) is 20 while the point total for hw1 (Homework 1)...

  • please do a and b Lab Exercise 9 Assignment Overview This lab exercise provides practice with...

    please do a and b Lab Exercise 9 Assignment Overview This lab exercise provides practice with dictionaries of lists and sets in Python. A. Write a program using Dictionaries of lists Consider the file named "lab9a.ру" Given two files named exactly continents. txt and cities.txt (no error checking of the file name is needed) of continents, countries and cities. Write a program to read continents, countries and their cities, put them in a nested dictionary and print them (no duplicates...

  • Before you start For this homework, we will need to import some libraries. You need to...

    Before you start For this homework, we will need to import some libraries. You need to execute the following cell only once; you don't need to copy this in every cell you run. In [ ]: import pandas import numpy from urllib.request import urlretrieve from matplotlib import pyplot %matplotlib inline ​ #This library is needed for testing from IPython.display import set_matplotlib_close set_matplotlib_close(False) Introduction In this homework, you will work with data from the World Bank. The subject of study is...

ADVERTISEMENT
Free Homework Help App
Download From Google Play
Scan Your Homework
to Get Instant Free Answers
Need Online Homework Help?
Ask a Question
Get Answers For Free
Most questions answered within 3 hours.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT