Assignment 3

Introduction to Pandas and Dataframes

Posted by Cindy Lai on February 5, 2017

Assignment 3

The US Department of Agriculture publishes price estimates for fruits and vegetables online. The most recent estimates are based on a 2013 survey of US retail stores.

The estimates are provided as a collection of MS Excel files, with one file per fruit or vegetable. The assignment3_data.zip file contains the fruit and vegetable files in the directories fruit and vegetables, respectively.

Exercise 1.1. Use pandas to extract the “Fresh” row(s) from the fruit Excel files. Combine the data into a single data frame. Your data frame should look something like this:

type food form price_per_lb yield lb_per_cup price_per_cup
fruit watermelon Fresh1 0.333412 0.52 0.330693 0.212033
fruit cantaloupe Fresh1 0.535874 0.51 0.374786 0.3938
vegetables onions Fresh1 1.03811 0.9 0.35274 0.406868
           

It’s okay if the rows and columns of your data frame are in a different order. These modules are especially relevant:

Ask questions and search the documentation/web to find the functions you need.

I set the path to the fruit directory and use a list comprehension to get fruit names.

import pandas as pd
import os
from os import listdir
from os.path import isfile, join

fruit_path = "data/fruit/"

num_files = sum(os.path.isfile(os.path.join(fruit_path, f)) for f in os.listdir(fruit_path))

onlyfiles = [f.split('.xlsx', 1)[0] for f in listdir(fruit_path) if isfile(join(fruit_path, f))]

col = ["type", "food", "form", "price_per_lb", "yield_v", "lb_per_cup", "price_per_cup"]

df = pd.DataFrame()

onlyfiles[1]
'apricots'

This is the extract function which will read in a fruit (input will be from onlyfiles), and extract specified values. It returns a dictionary.

def extract(fruit):
    bkb = pd.read_excel(fruit_path + fruit+".xlsx", header= None, skiprows = [0,1,2])
    fresh_row = bkb.iloc[0]
    price_per_lb = fresh_row[1]
    yield_v = fresh_row[3]
    lb_per_cup = fresh_row[4]
    price_per_cup = fresh_row[6]
    lout = ["fruit", fruit, "Fresh1" , price_per_lb, yield_v, lb_per_cup, price_per_cup]

    dout = {"type" : ["fruit"], "food" : [fruit], "form": ["Fresh1"], "price_per_lb" :[price_per_lb] , "yield_v":[yield_v], "lb_per_cup":[lb_per_cup], "price_per_cup":[price_per_cup]}

    return dout

extract(onlyfiles[1])
{'food': ['apricots'],
 'form': ['Fresh1'],
 'lb_per_cup': [0.363762732605048],
 'price_per_cup': [1.1891020280290363],
 'price_per_lb': [3.040071967096438],
 'type': ['fruit'],
 'yield_v': [0.93]}

The getFruit function will get all the fruits in the fruit folder and create a data frame out of all the entries.

def getFruit(fruitList):
    df2 = pd.DataFrame()
    l = []
    for f in fruitList:
        if '$' not in f:
            d = extract(f)
            new_df = pd.DataFrame.from_dict(d)
            l.append(new_df)
    df2 = pd.concat(l, axis = 0)
    return df2

fruit_df = getFruit(onlyfiles)
fruit_df
food form lb_per_cup price_per_cup price_per_lb type yield_v
0 apples Fresh1 0.242508 0.422373 1.567515 fruit 0.90
0 apricots Fresh1 0.363763 1.189102 3.040072 fruit 0.93
0 bananas Fresh1 0.330693 0.292965 0.566983 fruit 0.64
0 berries_mixed Fresh1 0.330693 1.127735 3.410215 fruit 1.00
0 blackberries Fresh1 0.319670 1.922919 5.774708 fruit 0.96
0 blueberries Fresh1 0.319670 1.593177 4.734622 fruit 0.95
0 cantaloupe Fresh1 0.374786 0.393800 0.535874 fruit 0.51
0 cherries Fresh1 0.341717 1.334548 3.592990 fruit 0.92
0 cranberries Fresh1 0.123163 0.589551 4.786741 fruit 1.00
0 dates Fresh1 0.165347 0.792234 4.791351 fruit 1.00
0 figs Fresh1 0.165347 0.990068 5.748318 fruit 0.96
0 fruit_cocktail Fresh1 NaN NaN NaN fruit NaN
0 grapefruit Fresh1 0.462971 0.848278 0.897802 fruit 0.49
0 grapes Fresh1 0.330693 0.721266 2.093827 fruit 0.96
0 honeydew Fresh1 0.374786 0.649077 0.796656 fruit 0.46
0 kiwi Fresh1 0.385809 1.037970 2.044683 fruit 0.76
0 mangoes Fresh1 0.363763 0.705783 1.377563 fruit 0.71
0 nectarines Fresh1 0.319670 0.618667 1.761148 fruit 0.91
0 oranges Fresh1 0.407855 0.578357 1.035173 fruit 0.73
0 papaya Fresh1 0.308647 0.646174 1.298012 fruit 0.62
0 peaches Fresh1 0.341717 0.566390 1.591187 fruit 0.96
0 pears Fresh1 0.363763 0.590740 1.461575 fruit 0.90
0 pineapple Fresh1 0.363763 0.447686 0.627662 fruit 0.51
0 plums Fresh1 0.363763 0.707176 1.827416 fruit 0.94
0 pomegranate Fresh1 0.341717 1.326342 2.173590 fruit 0.56
0 raspberries Fresh1 0.319670 2.322874 6.975811 fruit 0.96
0 strawberries Fresh1 0.319670 0.802171 2.358808 fruit 0.94
0 tangerines Fresh1 0.407855 0.759471 1.377962 fruit 0.74
0 watermelon Fresh1 0.330693 0.212033 0.333412 fruit 0.52

Exercise 1.2. Reuse your code from exercise 1.1 to extract the “Fresh” row(s) from the vegetable Excel files.

Does your code produce the correct prices for tomatoes? If not, why not? Do any other files have the same problem as the tomatoes file?

You don’t need to extract the prices for these problem files. However, make sure the prices are extracted for files like asparagus that don’t have this problem.

Using the same code for fruit, I simply change the path to vegetables.

vegetable_path = "data/vegetables/"

onlyfiles = [f.split('.xlsx', 1)[0] for f in listdir(vegetable_path) if isfile(join(vegetable_path, f))]


def extract(vegetable):
    bkb = pd.read_excel(vegetable_path + vegetable+".xlsx", header= None, skiprows = [0,1,2])
    fresh_row = bkb.iloc[0]
    price_per_lb = fresh_row[1]
    yield_v = fresh_row[3]
    lb_per_cup = fresh_row[4]
    price_per_cup = fresh_row[6]

    dout = {"type" : ["vegetable"], "food" : [vegetable], "form": ["Fresh1"], "price_per_lb" :[price_per_lb] , "yield_v":[yield_v], "lb_per_cup":[lb_per_cup], "price_per_cup":[price_per_cup]}

    '''
    vegetableSeries = pd.Series(data=["vegetable", vegetable, "Fresh1", price_per_lb, yield_v, lb_per_cup, price_per_cup], name = vegetable)
    return vegetableSeries
    '''
    return dout
extract(onlyfiles[1])
{'food': ['artichoke'],
 'form': ['Fresh1'],
 'lb_per_cup': [0.38580895882353577],
 'price_per_cup': [2.2749668026387808],
 'price_per_lb': [2.2130504792860322],
 'type': ['vegetable'],
 'yield_v': [0.37530864197530867]}
def getVegetable(vegetableList):
    df2 = pd.DataFrame()
    l = []
    for f in vegetableList:
        if '$' not in f:
            d = extract(f)
            new_df = pd.DataFrame.from_dict(d)
            l.append(new_df)
    df2 = pd.concat(l, axis = 0)
    return df2

vegetable_df = getVegetable(onlyfiles)
vegetable_df
food form lb_per_cup price_per_cup price_per_lb type yield_v
0 acorn_squash Fresh1 0.451948 1.155360 1.17225 vegetable 0.458554
0 artichoke Fresh1 0.385809 2.274967 2.21305 vegetable 0.375309
0 asparagus Fresh1 0.396832 2.582272 3.21349 vegetable 0.493835
0 avocados Fresh1 0.31967 0.964886 2.23587 vegetable 0.740753
0 beets Fresh1 0.374786 0.586555 1.01728 vegetable 0.65
0 blackeye_peas Fresh1 0.374786 0.524954 0.910441 vegetable 0.65
0 black_beans Fresh1 0.385809 0.582025 0.98058 vegetable 0.65
0 broccoli Fresh1 NaN vegetable
0 brussels_sprouts Fresh1 0.341717 0.890898 2.76355 vegetable 1.06
0 butternut_squash Fresh1 0.451948 0.787893 1.24474 vegetable 0.714
0 cabbage Fresh1 0.330693 0.245944 0.579208 vegetable 0.778797
0 carrots Fresh1 NaN NaN NaN vegetable NaN
0 cauliflower Fresh1 NaN vegetable
0 celery Fresh1 NaN NaN NaN vegetable NaN
0 collard_greens Fresh1 0.286601 0.650001 2.63084 vegetable 1.16
0 corn_sweet Fresh1 0.363763 1.812497 2.69062 vegetable 0.54
0 cucumbers Fresh1 0.264555 0.353448 1.29593 vegetable 0.97
0 great_northern_beans Fresh1 0.385809 0.548392 0.923916 vegetable 0.65
0 green_beans Fresh1 0.275578 0.696606 2.13997 vegetable 0.846575
0 green_peas Fresh1 0.35274 0.549769 1.01307 vegetable 0.65
0 green_peppers Fresh1 0.264555 0.455022 1.41036 vegetable 0.82
0 kale Fresh1 0.286601 0.766262 2.8073 vegetable 1.05
0 kidney_beans Fresh1 0.385809 0.535194 0.90168 vegetable 0.65
0 lentils Fresh1 0.385809 0.196738 1.38504 vegetable 2.7161
0 lettuce_iceberg Fresh1 0.242508 0.309655 1.21304 vegetable 0.95
0 lettuce_romaine Fresh1 NaN vegetable NaN
0 lima_beans Fresh1 0.374786 0.797757 1.38357 vegetable 0.65
0 mixed_vegetables Fresh1 NaN vegetable
0 mushrooms Fresh1 NaN NaN NaN vegetable NaN
0 mustard_greens Fresh1 0.308647 0.944032 2.56924 vegetable 0.84
0 navy_beans Fresh1 0.385809 0.575997 0.970423 vegetable 0.65
0 okra Fresh1 0.35274 1.473146 3.21355 vegetable 0.769474
0 olives Fresh1 0.297624 1.246102 4.18683 vegetable 1
0 onions Fresh1 0.35274 0.406868 1.03811 vegetable 0.9
0 pinto_beans Fresh1 0.385809 0.514129 0.86619 vegetable 0.65
0 potatoes Fresh1 0.264555 0.184017 0.56432 vegetable 0.811301
0 pumpkin Fresh1 0.540133 0.730411 1.35228 vegetable 1
0 radish Fresh1 0.275578 0.401618 1.31163 vegetable 0.9
0 red_peppers Fresh1 0.264555 0.734926 2.27794 vegetable 0.82
0 spinach Fresh1 NaN NaN NaN vegetable NaN
0 summer_squash Fresh1 0.396832 0.845480 1.63948 vegetable 0.7695
0 sweet_potatoes Fresh1 0.440925 0.499400 0.918897 vegetable 0.811301
0 tomatoes Fresh1 NaN NaN NaN vegetable NaN
0 turnip_greens Fresh1 0.31967 1.053526 2.47175 vegetable 0.75

The tomatoes entry does not have the correct information (it displays NaN for all its rows instead). Broccoli, celery, carrots, cauliflower, romaine lettuce, mixed vegetables, mushrooms, and spinach also have NaN values.

This is because these entries have nothing on the row which is labeled “Fresh”. There are different types of those vegetables which are sold at different prices.

Exercise 1.3. Remove rows without a price from the vegetable data frame and then combine the fruit and vegetable data frames. Make sure all columns of numbers are numeric (not strings).

I now combine the vegetable and fruit files into one data frame. I then clean for invalid entries.

vegetable_path = "data/vegetables/"
fruit_path = "data/fruit/"

num_files = sum(os.path.isfile(os.path.join(vegetable_path, f)) for f in os.listdir(vegetable_path))

vegetableFiles = [f.split('.xlsx', 1)[0] for f in listdir(vegetable_path) if isfile(join(vegetable_path, f))]
fruitFiles  = [f.split('.xlsx', 1)[0] for f in listdir(fruit_path) if isfile(join(fruit_path, f))]

col = ["type", "food", "form", "price_per_lb", "yield_v", "lb_per_cup", "price_per_cup"]

df = pd.DataFrame()

#extract function, works for both fruits and veggies
def extract(typeO, path, obj):
    bkb = pd.read_excel(path + obj+".xlsx", header= None, skiprows = [0,1,2])
    fresh_row = bkb.iloc[0]
    price_per_lb = fresh_row[1]
    yield_v = fresh_row[3]
    lb_per_cup = fresh_row[4]
    price_per_cup = fresh_row[6]
    dout = {"type" : [typeO], "food" : [obj], "form": ["Fresh1"], "price_per_lb" :[price_per_lb] , "yield_v":[yield_v], "lb_per_cup":[lb_per_cup], "price_per_cup":[price_per_cup]}

    return dout
#extract function, works for both fruits and veggies
def extract(typeO, path, obj):
    bkb = pd.read_excel(path + obj+".xlsx", header= None, skiprows = [0,1,2])
    fresh_row = bkb.iloc[0]
    price_per_lb = fresh_row[1]
    yield_v = fresh_row[3]
    lb_per_cup = fresh_row[4]
    price_per_cup = fresh_row[6]
    dout = {"type" : [typeO], "food" : [obj], "form": ["Fresh1"], "price_per_lb" :[price_per_lb] , "yield_v":[yield_v], "lb_per_cup":[lb_per_cup], "price_per_cup":[price_per_cup]}

    return dout
#returns combined dataframe of both fruits and veggies
def getBoth(fruitList, vegetableList):
    df2 = pd.DataFrame()
    l = []
    for f in fruitList:
        if '$' not in f:
            d = extract("fruit", fruit_path, f)
            new_df = pd.DataFrame.from_dict(d)
            l.append(new_df)
    for f in vegetableList:
         if '$' not in f:
            d = extract("vegetables", vegetable_path, f)
            new_df = pd.DataFrame.from_dict(d)
            l.append(new_df)

    df2 = pd.concat(l, axis = 0)
    return df2

merged_df = getBoth(fruitFiles, vegetableFiles)
merged_df[:5]
food form lb_per_cup price_per_cup price_per_lb type yield_v
0 apples Fresh1 0.242508 0.422373 1.56752 fruit 0.9
0 apricots Fresh1 0.363763 1.189102 3.04007 fruit 0.93
0 bananas Fresh1 0.330693 0.292965 0.566983 fruit 0.64
0 berries_mixed Fresh1 0.330693 1.127735 3.41021 fruit 1
0 blackberries Fresh1 0.31967 1.922919 5.77471 fruit 0.96

I now clean the data to remove invalid rows.

def cleanData(df):
    df2 = df.apply(lambda x: pd.to_numeric(x, errors='ignore'))
    df2 = df2[pd.notnull(df2['price_per_lb'])]
    df2 = df2[pd.notnull(df2['price_per_cup'])]
    return df2

clean_df = cleanData(merged_df)
clean_df
food form lb_per_cup price_per_cup price_per_lb type yield_v
0 apples Fresh1 0.242508 0.422373 1.56752 fruit 0.9
0 apricots Fresh1 0.363763 1.189102 3.04007 fruit 0.93
0 bananas Fresh1 0.330693 0.292965 0.566983 fruit 0.64
0 berries_mixed Fresh1 0.330693 1.127735 3.41021 fruit 1
0 blackberries Fresh1 0.31967 1.922919 5.77471 fruit 0.96
0 blueberries Fresh1 0.31967 1.593177 4.73462 fruit 0.95
0 cantaloupe Fresh1 0.374786 0.393800 0.535874 fruit 0.51
0 cherries Fresh1 0.341717 1.334548 3.59299 fruit 0.92
0 cranberries Fresh1 0.123163 0.589551 4.78674 fruit 1
0 dates Fresh1 0.165347 0.792234 4.79135 fruit 1
0 figs Fresh1 0.165347 0.990068 5.74832 fruit 0.96
0 grapefruit Fresh1 0.462971 0.848278 0.897802 fruit 0.49
0 grapes Fresh1 0.330693 0.721266 2.09383 fruit 0.96
0 honeydew Fresh1 0.374786 0.649077 0.796656 fruit 0.46
0 kiwi Fresh1 0.385809 1.037970 2.04468 fruit 0.76
0 mangoes Fresh1 0.363763 0.705783 1.37756 fruit 0.71
0 nectarines Fresh1 0.31967 0.618667 1.76115 fruit 0.91
0 oranges Fresh1 0.407855 0.578357 1.03517 fruit 0.73
0 papaya Fresh1 0.308647 0.646174 1.29801 fruit 0.62
0 peaches Fresh1 0.341717 0.566390 1.59119 fruit 0.96
0 pears Fresh1 0.363763 0.590740 1.46157 fruit 0.9
0 pineapple Fresh1 0.363763 0.447686 0.627662 fruit 0.51
0 plums Fresh1 0.363763 0.707176 1.82742 fruit 0.94
0 pomegranate Fresh1 0.341717 1.326342 2.17359 fruit 0.56
0 raspberries Fresh1 0.31967 2.322874 6.97581 fruit 0.96
0 strawberries Fresh1 0.31967 0.802171 2.35881 fruit 0.94
0 tangerines Fresh1 0.407855 0.759471 1.37796 fruit 0.74
0 watermelon Fresh1 0.330693 0.212033 0.333412 fruit 0.52
0 acorn_squash Fresh1 0.451948 1.155360 1.17225 vegetables 0.458554
0 artichoke Fresh1 0.385809 2.274967 2.21305 vegetables 0.375309
... ... ... ... ... ... ... ...
0 blackeye_peas Fresh1 0.374786 0.524954 0.910441 vegetables 0.65
0 black_beans Fresh1 0.385809 0.582025 0.98058 vegetables 0.65
0 brussels_sprouts Fresh1 0.341717 0.890898 2.76355 vegetables 1.06
0 butternut_squash Fresh1 0.451948 0.787893 1.24474 vegetables 0.714
0 cabbage Fresh1 0.330693 0.245944 0.579208 vegetables 0.778797
0 collard_greens Fresh1 0.286601 0.650001 2.63084 vegetables 1.16
0 corn_sweet Fresh1 0.363763 1.812497 2.69062 vegetables 0.54
0 cucumbers Fresh1 0.264555 0.353448 1.29593 vegetables 0.97
0 great_northern_beans Fresh1 0.385809 0.548392 0.923916 vegetables 0.65
0 green_beans Fresh1 0.275578 0.696606 2.13997 vegetables 0.846575
0 green_peas Fresh1 0.35274 0.549769 1.01307 vegetables 0.65
0 green_peppers Fresh1 0.264555 0.455022 1.41036 vegetables 0.82
0 kale Fresh1 0.286601 0.766262 2.8073 vegetables 1.05
0 kidney_beans Fresh1 0.385809 0.535194 0.90168 vegetables 0.65
0 lentils Fresh1 0.385809 0.196738 1.38504 vegetables 2.7161
0 lettuce_iceberg Fresh1 0.242508 0.309655 1.21304 vegetables 0.95
0 lima_beans Fresh1 0.374786 0.797757 1.38357 vegetables 0.65
0 mustard_greens Fresh1 0.308647 0.944032 2.56924 vegetables 0.84
0 navy_beans Fresh1 0.385809 0.575997 0.970423 vegetables 0.65
0 okra Fresh1 0.35274 1.473146 3.21355 vegetables 0.769474
0 olives Fresh1 0.297624 1.246102 4.18683 vegetables 1
0 onions Fresh1 0.35274 0.406868 1.03811 vegetables 0.9
0 pinto_beans Fresh1 0.385809 0.514129 0.86619 vegetables 0.65
0 potatoes Fresh1 0.264555 0.184017 0.56432 vegetables 0.811301
0 pumpkin Fresh1 0.540133 0.730411 1.35228 vegetables 1
0 radish Fresh1 0.275578 0.401618 1.31163 vegetables 0.9
0 red_peppers Fresh1 0.264555 0.734926 2.27794 vegetables 0.82
0 summer_squash Fresh1 0.396832 0.845480 1.63948 vegetables 0.7695
0 sweet_potatoes Fresh1 0.440925 0.499400 0.918897 vegetables 0.811301
0 turnip_greens Fresh1 0.31967 1.053526 2.47175 vegetables 0.75

63 rows × 7 columns

Exercise 1.4. Discuss the questions below (a paragraph each is sufficient). Use plots to support your ideas.

  • What kinds of fruits are the most expensive (per pound)? What kinds are the least expensive?
  • How do the price distributions compare for fruit and vegetables?
  • Which foods are the best value for the price?
  • What’s something surprising about this data set?
  • Which foods do you expect to provide the best combination of price, yield, and nutrition? A future assignment may combine this data set with another so you can check your hypothesis.
  1. Most expensive fruits are the berries (raspberries, blackberries, cranberries, blueberries) and figs/dates. The least expensive are bananas, watermelon, cantalope, and pineapple.
fruit_df.sort_values(['price_per_lb'], ascending = 0)[:9]
food form lb_per_cup price_per_cup price_per_lb type yield_v
0 raspberries Fresh1 0.319670 2.322874 6.975811 fruit 0.96
0 blackberries Fresh1 0.319670 1.922919 5.774708 fruit 0.96
0 figs Fresh1 0.165347 0.990068 5.748318 fruit 0.96
0 dates Fresh1 0.165347 0.792234 4.791351 fruit 1.00
0 cranberries Fresh1 0.123163 0.589551 4.786741 fruit 1.00
0 blueberries Fresh1 0.319670 1.593177 4.734622 fruit 0.95
0 cherries Fresh1 0.341717 1.334548 3.592990 fruit 0.92
0 berries_mixed Fresh1 0.330693 1.127735 3.410215 fruit 1.00
0 apricots Fresh1 0.363763 1.189102 3.040072 fruit 0.93
fruit_df.sort_values(['price_per_lb'], ascending = 1)[:9]
food form lb_per_cup price_per_cup price_per_lb type yield_v
0 watermelon Fresh1 0.330693 0.212033 0.333412 fruit 0.52
0 cantaloupe Fresh1 0.374786 0.393800 0.535874 fruit 0.51
0 bananas Fresh1 0.330693 0.292965 0.566983 fruit 0.64
0 pineapple Fresh1 0.363763 0.447686 0.627662 fruit 0.51
0 honeydew Fresh1 0.374786 0.649077 0.796656 fruit 0.46
0 grapefruit Fresh1 0.462971 0.848278 0.897802 fruit 0.49
0 oranges Fresh1 0.407855 0.578357 1.035173 fruit 0.73
0 papaya Fresh1 0.308647 0.646174 1.298012 fruit 0.62
0 mangoes Fresh1 0.363763 0.705783 1.377563 fruit 0.71
clean_df.boxplot("type", "price_per_lb")
plt.show()