Assignment 3
The US Department of Agriculture publishes price estimates for fruits and vegetables online. The most recent estimates are based on a 2013 survey of US retail stores.
The estimates are provided as a collection of MS Excel files, with one file per fruit or vegetable. The assignment3_data.zip
file contains the fruit and vegetable files in the directories fruit
and vegetables
, respectively.
Exercise 1.1. Use pandas to extract the “Fresh” row(s) from the fruit Excel files. Combine the data into a single data frame. Your data frame should look something like this:
type | food | form | price_per_lb | yield | lb_per_cup | price_per_cup |
---|---|---|---|---|---|---|
fruit | watermelon | Fresh1 | 0.333412 | 0.52 | 0.330693 | 0.212033 |
fruit | cantaloupe | Fresh1 | 0.535874 | 0.51 | 0.374786 | 0.3938 |
vegetables | onions | Fresh1 | 1.03811 | 0.9 | 0.35274 | 0.406868 |
… |
It’s okay if the rows and columns of your data frame are in a different order. These modules are especially relevant:
str
methodsos
os.path
- pandas:
read_excel()
,concat()
,.fillna()
,.str
, plotting methods
Ask questions and search the documentation/web to find the functions you need.
I set the path to the fruit directory and use a list comprehension to get fruit names.
import pandas as pd
import os
from os import listdir
from os.path import isfile, join
fruit_path = "data/fruit/"
num_files = sum(os.path.isfile(os.path.join(fruit_path, f)) for f in os.listdir(fruit_path))
onlyfiles = [f.split('.xlsx', 1)[0] for f in listdir(fruit_path) if isfile(join(fruit_path, f))]
col = ["type", "food", "form", "price_per_lb", "yield_v", "lb_per_cup", "price_per_cup"]
df = pd.DataFrame()
onlyfiles[1]
'apricots'
This is the extract function which will read in a fruit (input will be from onlyfiles), and extract specified values. It returns a dictionary.
def extract(fruit):
bkb = pd.read_excel(fruit_path + fruit+".xlsx", header= None, skiprows = [0,1,2])
fresh_row = bkb.iloc[0]
price_per_lb = fresh_row[1]
yield_v = fresh_row[3]
lb_per_cup = fresh_row[4]
price_per_cup = fresh_row[6]
lout = ["fruit", fruit, "Fresh1" , price_per_lb, yield_v, lb_per_cup, price_per_cup]
dout = {"type" : ["fruit"], "food" : [fruit], "form": ["Fresh1"], "price_per_lb" :[price_per_lb] , "yield_v":[yield_v], "lb_per_cup":[lb_per_cup], "price_per_cup":[price_per_cup]}
return dout
extract(onlyfiles[1])
{'food': ['apricots'],
'form': ['Fresh1'],
'lb_per_cup': [0.363762732605048],
'price_per_cup': [1.1891020280290363],
'price_per_lb': [3.040071967096438],
'type': ['fruit'],
'yield_v': [0.93]}
The getFruit function will get all the fruits in the fruit folder and create a data frame out of all the entries.
def getFruit(fruitList):
df2 = pd.DataFrame()
l = []
for f in fruitList:
if '$' not in f:
d = extract(f)
new_df = pd.DataFrame.from_dict(d)
l.append(new_df)
df2 = pd.concat(l, axis = 0)
return df2
fruit_df = getFruit(onlyfiles)
fruit_df
food | form | lb_per_cup | price_per_cup | price_per_lb | type | yield_v | |
---|---|---|---|---|---|---|---|
0 | apples | Fresh1 | 0.242508 | 0.422373 | 1.567515 | fruit | 0.90 |
0 | apricots | Fresh1 | 0.363763 | 1.189102 | 3.040072 | fruit | 0.93 |
0 | bananas | Fresh1 | 0.330693 | 0.292965 | 0.566983 | fruit | 0.64 |
0 | berries_mixed | Fresh1 | 0.330693 | 1.127735 | 3.410215 | fruit | 1.00 |
0 | blackberries | Fresh1 | 0.319670 | 1.922919 | 5.774708 | fruit | 0.96 |
0 | blueberries | Fresh1 | 0.319670 | 1.593177 | 4.734622 | fruit | 0.95 |
0 | cantaloupe | Fresh1 | 0.374786 | 0.393800 | 0.535874 | fruit | 0.51 |
0 | cherries | Fresh1 | 0.341717 | 1.334548 | 3.592990 | fruit | 0.92 |
0 | cranberries | Fresh1 | 0.123163 | 0.589551 | 4.786741 | fruit | 1.00 |
0 | dates | Fresh1 | 0.165347 | 0.792234 | 4.791351 | fruit | 1.00 |
0 | figs | Fresh1 | 0.165347 | 0.990068 | 5.748318 | fruit | 0.96 |
0 | fruit_cocktail | Fresh1 | NaN | NaN | NaN | fruit | NaN |
0 | grapefruit | Fresh1 | 0.462971 | 0.848278 | 0.897802 | fruit | 0.49 |
0 | grapes | Fresh1 | 0.330693 | 0.721266 | 2.093827 | fruit | 0.96 |
0 | honeydew | Fresh1 | 0.374786 | 0.649077 | 0.796656 | fruit | 0.46 |
0 | kiwi | Fresh1 | 0.385809 | 1.037970 | 2.044683 | fruit | 0.76 |
0 | mangoes | Fresh1 | 0.363763 | 0.705783 | 1.377563 | fruit | 0.71 |
0 | nectarines | Fresh1 | 0.319670 | 0.618667 | 1.761148 | fruit | 0.91 |
0 | oranges | Fresh1 | 0.407855 | 0.578357 | 1.035173 | fruit | 0.73 |
0 | papaya | Fresh1 | 0.308647 | 0.646174 | 1.298012 | fruit | 0.62 |
0 | peaches | Fresh1 | 0.341717 | 0.566390 | 1.591187 | fruit | 0.96 |
0 | pears | Fresh1 | 0.363763 | 0.590740 | 1.461575 | fruit | 0.90 |
0 | pineapple | Fresh1 | 0.363763 | 0.447686 | 0.627662 | fruit | 0.51 |
0 | plums | Fresh1 | 0.363763 | 0.707176 | 1.827416 | fruit | 0.94 |
0 | pomegranate | Fresh1 | 0.341717 | 1.326342 | 2.173590 | fruit | 0.56 |
0 | raspberries | Fresh1 | 0.319670 | 2.322874 | 6.975811 | fruit | 0.96 |
0 | strawberries | Fresh1 | 0.319670 | 0.802171 | 2.358808 | fruit | 0.94 |
0 | tangerines | Fresh1 | 0.407855 | 0.759471 | 1.377962 | fruit | 0.74 |
0 | watermelon | Fresh1 | 0.330693 | 0.212033 | 0.333412 | fruit | 0.52 |
Exercise 1.2. Reuse your code from exercise 1.1 to extract the “Fresh” row(s) from the vegetable Excel files.
Does your code produce the correct prices for tomatoes? If not, why not? Do any other files have the same problem as the tomatoes file?
You don’t need to extract the prices for these problem files. However, make sure the prices are extracted for files like asparagus that don’t have this problem.
Using the same code for fruit, I simply change the path to vegetables.
vegetable_path = "data/vegetables/"
onlyfiles = [f.split('.xlsx', 1)[0] for f in listdir(vegetable_path) if isfile(join(vegetable_path, f))]
def extract(vegetable):
bkb = pd.read_excel(vegetable_path + vegetable+".xlsx", header= None, skiprows = [0,1,2])
fresh_row = bkb.iloc[0]
price_per_lb = fresh_row[1]
yield_v = fresh_row[3]
lb_per_cup = fresh_row[4]
price_per_cup = fresh_row[6]
dout = {"type" : ["vegetable"], "food" : [vegetable], "form": ["Fresh1"], "price_per_lb" :[price_per_lb] , "yield_v":[yield_v], "lb_per_cup":[lb_per_cup], "price_per_cup":[price_per_cup]}
'''
vegetableSeries = pd.Series(data=["vegetable", vegetable, "Fresh1", price_per_lb, yield_v, lb_per_cup, price_per_cup], name = vegetable)
return vegetableSeries
'''
return dout
extract(onlyfiles[1])
{'food': ['artichoke'],
'form': ['Fresh1'],
'lb_per_cup': [0.38580895882353577],
'price_per_cup': [2.2749668026387808],
'price_per_lb': [2.2130504792860322],
'type': ['vegetable'],
'yield_v': [0.37530864197530867]}
def getVegetable(vegetableList):
df2 = pd.DataFrame()
l = []
for f in vegetableList:
if '$' not in f:
d = extract(f)
new_df = pd.DataFrame.from_dict(d)
l.append(new_df)
df2 = pd.concat(l, axis = 0)
return df2
vegetable_df = getVegetable(onlyfiles)
vegetable_df
food | form | lb_per_cup | price_per_cup | price_per_lb | type | yield_v | |
---|---|---|---|---|---|---|---|
0 | acorn_squash | Fresh1 | 0.451948 | 1.155360 | 1.17225 | vegetable | 0.458554 |
0 | artichoke | Fresh1 | 0.385809 | 2.274967 | 2.21305 | vegetable | 0.375309 |
0 | asparagus | Fresh1 | 0.396832 | 2.582272 | 3.21349 | vegetable | 0.493835 |
0 | avocados | Fresh1 | 0.31967 | 0.964886 | 2.23587 | vegetable | 0.740753 |
0 | beets | Fresh1 | 0.374786 | 0.586555 | 1.01728 | vegetable | 0.65 |
0 | blackeye_peas | Fresh1 | 0.374786 | 0.524954 | 0.910441 | vegetable | 0.65 |
0 | black_beans | Fresh1 | 0.385809 | 0.582025 | 0.98058 | vegetable | 0.65 |
0 | broccoli | Fresh1 | NaN | vegetable | |||
0 | brussels_sprouts | Fresh1 | 0.341717 | 0.890898 | 2.76355 | vegetable | 1.06 |
0 | butternut_squash | Fresh1 | 0.451948 | 0.787893 | 1.24474 | vegetable | 0.714 |
0 | cabbage | Fresh1 | 0.330693 | 0.245944 | 0.579208 | vegetable | 0.778797 |
0 | carrots | Fresh1 | NaN | NaN | NaN | vegetable | NaN |
0 | cauliflower | Fresh1 | NaN | vegetable | |||
0 | celery | Fresh1 | NaN | NaN | NaN | vegetable | NaN |
0 | collard_greens | Fresh1 | 0.286601 | 0.650001 | 2.63084 | vegetable | 1.16 |
0 | corn_sweet | Fresh1 | 0.363763 | 1.812497 | 2.69062 | vegetable | 0.54 |
0 | cucumbers | Fresh1 | 0.264555 | 0.353448 | 1.29593 | vegetable | 0.97 |
0 | great_northern_beans | Fresh1 | 0.385809 | 0.548392 | 0.923916 | vegetable | 0.65 |
0 | green_beans | Fresh1 | 0.275578 | 0.696606 | 2.13997 | vegetable | 0.846575 |
0 | green_peas | Fresh1 | 0.35274 | 0.549769 | 1.01307 | vegetable | 0.65 |
0 | green_peppers | Fresh1 | 0.264555 | 0.455022 | 1.41036 | vegetable | 0.82 |
0 | kale | Fresh1 | 0.286601 | 0.766262 | 2.8073 | vegetable | 1.05 |
0 | kidney_beans | Fresh1 | 0.385809 | 0.535194 | 0.90168 | vegetable | 0.65 |
0 | lentils | Fresh1 | 0.385809 | 0.196738 | 1.38504 | vegetable | 2.7161 |
0 | lettuce_iceberg | Fresh1 | 0.242508 | 0.309655 | 1.21304 | vegetable | 0.95 |
0 | lettuce_romaine | Fresh1 | NaN | vegetable | NaN | ||
0 | lima_beans | Fresh1 | 0.374786 | 0.797757 | 1.38357 | vegetable | 0.65 |
0 | mixed_vegetables | Fresh1 | NaN | vegetable | |||
0 | mushrooms | Fresh1 | NaN | NaN | NaN | vegetable | NaN |
0 | mustard_greens | Fresh1 | 0.308647 | 0.944032 | 2.56924 | vegetable | 0.84 |
0 | navy_beans | Fresh1 | 0.385809 | 0.575997 | 0.970423 | vegetable | 0.65 |
0 | okra | Fresh1 | 0.35274 | 1.473146 | 3.21355 | vegetable | 0.769474 |
0 | olives | Fresh1 | 0.297624 | 1.246102 | 4.18683 | vegetable | 1 |
0 | onions | Fresh1 | 0.35274 | 0.406868 | 1.03811 | vegetable | 0.9 |
0 | pinto_beans | Fresh1 | 0.385809 | 0.514129 | 0.86619 | vegetable | 0.65 |
0 | potatoes | Fresh1 | 0.264555 | 0.184017 | 0.56432 | vegetable | 0.811301 |
0 | pumpkin | Fresh1 | 0.540133 | 0.730411 | 1.35228 | vegetable | 1 |
0 | radish | Fresh1 | 0.275578 | 0.401618 | 1.31163 | vegetable | 0.9 |
0 | red_peppers | Fresh1 | 0.264555 | 0.734926 | 2.27794 | vegetable | 0.82 |
0 | spinach | Fresh1 | NaN | NaN | NaN | vegetable | NaN |
0 | summer_squash | Fresh1 | 0.396832 | 0.845480 | 1.63948 | vegetable | 0.7695 |
0 | sweet_potatoes | Fresh1 | 0.440925 | 0.499400 | 0.918897 | vegetable | 0.811301 |
0 | tomatoes | Fresh1 | NaN | NaN | NaN | vegetable | NaN |
0 | turnip_greens | Fresh1 | 0.31967 | 1.053526 | 2.47175 | vegetable | 0.75 |
The tomatoes entry does not have the correct information (it displays NaN for all its rows instead). Broccoli, celery, carrots, cauliflower, romaine lettuce, mixed vegetables, mushrooms, and spinach also have NaN values.
This is because these entries have nothing on the row which is labeled “Fresh”. There are different types of those vegetables which are sold at different prices.
Exercise 1.3. Remove rows without a price from the vegetable data frame and then combine the fruit and vegetable data frames. Make sure all columns of numbers are numeric (not strings).
I now combine the vegetable and fruit files into one data frame. I then clean for invalid entries.
vegetable_path = "data/vegetables/"
fruit_path = "data/fruit/"
num_files = sum(os.path.isfile(os.path.join(vegetable_path, f)) for f in os.listdir(vegetable_path))
vegetableFiles = [f.split('.xlsx', 1)[0] for f in listdir(vegetable_path) if isfile(join(vegetable_path, f))]
fruitFiles = [f.split('.xlsx', 1)[0] for f in listdir(fruit_path) if isfile(join(fruit_path, f))]
col = ["type", "food", "form", "price_per_lb", "yield_v", "lb_per_cup", "price_per_cup"]
df = pd.DataFrame()
#extract function, works for both fruits and veggies
def extract(typeO, path, obj):
bkb = pd.read_excel(path + obj+".xlsx", header= None, skiprows = [0,1,2])
fresh_row = bkb.iloc[0]
price_per_lb = fresh_row[1]
yield_v = fresh_row[3]
lb_per_cup = fresh_row[4]
price_per_cup = fresh_row[6]
dout = {"type" : [typeO], "food" : [obj], "form": ["Fresh1"], "price_per_lb" :[price_per_lb] , "yield_v":[yield_v], "lb_per_cup":[lb_per_cup], "price_per_cup":[price_per_cup]}
return dout
#extract function, works for both fruits and veggies
def extract(typeO, path, obj):
bkb = pd.read_excel(path + obj+".xlsx", header= None, skiprows = [0,1,2])
fresh_row = bkb.iloc[0]
price_per_lb = fresh_row[1]
yield_v = fresh_row[3]
lb_per_cup = fresh_row[4]
price_per_cup = fresh_row[6]
dout = {"type" : [typeO], "food" : [obj], "form": ["Fresh1"], "price_per_lb" :[price_per_lb] , "yield_v":[yield_v], "lb_per_cup":[lb_per_cup], "price_per_cup":[price_per_cup]}
return dout
#returns combined dataframe of both fruits and veggies
def getBoth(fruitList, vegetableList):
df2 = pd.DataFrame()
l = []
for f in fruitList:
if '$' not in f:
d = extract("fruit", fruit_path, f)
new_df = pd.DataFrame.from_dict(d)
l.append(new_df)
for f in vegetableList:
if '$' not in f:
d = extract("vegetables", vegetable_path, f)
new_df = pd.DataFrame.from_dict(d)
l.append(new_df)
df2 = pd.concat(l, axis = 0)
return df2
merged_df = getBoth(fruitFiles, vegetableFiles)
merged_df[:5]
food | form | lb_per_cup | price_per_cup | price_per_lb | type | yield_v | |
---|---|---|---|---|---|---|---|
0 | apples | Fresh1 | 0.242508 | 0.422373 | 1.56752 | fruit | 0.9 |
0 | apricots | Fresh1 | 0.363763 | 1.189102 | 3.04007 | fruit | 0.93 |
0 | bananas | Fresh1 | 0.330693 | 0.292965 | 0.566983 | fruit | 0.64 |
0 | berries_mixed | Fresh1 | 0.330693 | 1.127735 | 3.41021 | fruit | 1 |
0 | blackberries | Fresh1 | 0.31967 | 1.922919 | 5.77471 | fruit | 0.96 |
I now clean the data to remove invalid rows.
def cleanData(df):
df2 = df.apply(lambda x: pd.to_numeric(x, errors='ignore'))
df2 = df2[pd.notnull(df2['price_per_lb'])]
df2 = df2[pd.notnull(df2['price_per_cup'])]
return df2
clean_df = cleanData(merged_df)
clean_df
food | form | lb_per_cup | price_per_cup | price_per_lb | type | yield_v | |
---|---|---|---|---|---|---|---|
0 | apples | Fresh1 | 0.242508 | 0.422373 | 1.56752 | fruit | 0.9 |
0 | apricots | Fresh1 | 0.363763 | 1.189102 | 3.04007 | fruit | 0.93 |
0 | bananas | Fresh1 | 0.330693 | 0.292965 | 0.566983 | fruit | 0.64 |
0 | berries_mixed | Fresh1 | 0.330693 | 1.127735 | 3.41021 | fruit | 1 |
0 | blackberries | Fresh1 | 0.31967 | 1.922919 | 5.77471 | fruit | 0.96 |
0 | blueberries | Fresh1 | 0.31967 | 1.593177 | 4.73462 | fruit | 0.95 |
0 | cantaloupe | Fresh1 | 0.374786 | 0.393800 | 0.535874 | fruit | 0.51 |
0 | cherries | Fresh1 | 0.341717 | 1.334548 | 3.59299 | fruit | 0.92 |
0 | cranberries | Fresh1 | 0.123163 | 0.589551 | 4.78674 | fruit | 1 |
0 | dates | Fresh1 | 0.165347 | 0.792234 | 4.79135 | fruit | 1 |
0 | figs | Fresh1 | 0.165347 | 0.990068 | 5.74832 | fruit | 0.96 |
0 | grapefruit | Fresh1 | 0.462971 | 0.848278 | 0.897802 | fruit | 0.49 |
0 | grapes | Fresh1 | 0.330693 | 0.721266 | 2.09383 | fruit | 0.96 |
0 | honeydew | Fresh1 | 0.374786 | 0.649077 | 0.796656 | fruit | 0.46 |
0 | kiwi | Fresh1 | 0.385809 | 1.037970 | 2.04468 | fruit | 0.76 |
0 | mangoes | Fresh1 | 0.363763 | 0.705783 | 1.37756 | fruit | 0.71 |
0 | nectarines | Fresh1 | 0.31967 | 0.618667 | 1.76115 | fruit | 0.91 |
0 | oranges | Fresh1 | 0.407855 | 0.578357 | 1.03517 | fruit | 0.73 |
0 | papaya | Fresh1 | 0.308647 | 0.646174 | 1.29801 | fruit | 0.62 |
0 | peaches | Fresh1 | 0.341717 | 0.566390 | 1.59119 | fruit | 0.96 |
0 | pears | Fresh1 | 0.363763 | 0.590740 | 1.46157 | fruit | 0.9 |
0 | pineapple | Fresh1 | 0.363763 | 0.447686 | 0.627662 | fruit | 0.51 |
0 | plums | Fresh1 | 0.363763 | 0.707176 | 1.82742 | fruit | 0.94 |
0 | pomegranate | Fresh1 | 0.341717 | 1.326342 | 2.17359 | fruit | 0.56 |
0 | raspberries | Fresh1 | 0.31967 | 2.322874 | 6.97581 | fruit | 0.96 |
0 | strawberries | Fresh1 | 0.31967 | 0.802171 | 2.35881 | fruit | 0.94 |
0 | tangerines | Fresh1 | 0.407855 | 0.759471 | 1.37796 | fruit | 0.74 |
0 | watermelon | Fresh1 | 0.330693 | 0.212033 | 0.333412 | fruit | 0.52 |
0 | acorn_squash | Fresh1 | 0.451948 | 1.155360 | 1.17225 | vegetables | 0.458554 |
0 | artichoke | Fresh1 | 0.385809 | 2.274967 | 2.21305 | vegetables | 0.375309 |
... | ... | ... | ... | ... | ... | ... | ... |
0 | blackeye_peas | Fresh1 | 0.374786 | 0.524954 | 0.910441 | vegetables | 0.65 |
0 | black_beans | Fresh1 | 0.385809 | 0.582025 | 0.98058 | vegetables | 0.65 |
0 | brussels_sprouts | Fresh1 | 0.341717 | 0.890898 | 2.76355 | vegetables | 1.06 |
0 | butternut_squash | Fresh1 | 0.451948 | 0.787893 | 1.24474 | vegetables | 0.714 |
0 | cabbage | Fresh1 | 0.330693 | 0.245944 | 0.579208 | vegetables | 0.778797 |
0 | collard_greens | Fresh1 | 0.286601 | 0.650001 | 2.63084 | vegetables | 1.16 |
0 | corn_sweet | Fresh1 | 0.363763 | 1.812497 | 2.69062 | vegetables | 0.54 |
0 | cucumbers | Fresh1 | 0.264555 | 0.353448 | 1.29593 | vegetables | 0.97 |
0 | great_northern_beans | Fresh1 | 0.385809 | 0.548392 | 0.923916 | vegetables | 0.65 |
0 | green_beans | Fresh1 | 0.275578 | 0.696606 | 2.13997 | vegetables | 0.846575 |
0 | green_peas | Fresh1 | 0.35274 | 0.549769 | 1.01307 | vegetables | 0.65 |
0 | green_peppers | Fresh1 | 0.264555 | 0.455022 | 1.41036 | vegetables | 0.82 |
0 | kale | Fresh1 | 0.286601 | 0.766262 | 2.8073 | vegetables | 1.05 |
0 | kidney_beans | Fresh1 | 0.385809 | 0.535194 | 0.90168 | vegetables | 0.65 |
0 | lentils | Fresh1 | 0.385809 | 0.196738 | 1.38504 | vegetables | 2.7161 |
0 | lettuce_iceberg | Fresh1 | 0.242508 | 0.309655 | 1.21304 | vegetables | 0.95 |
0 | lima_beans | Fresh1 | 0.374786 | 0.797757 | 1.38357 | vegetables | 0.65 |
0 | mustard_greens | Fresh1 | 0.308647 | 0.944032 | 2.56924 | vegetables | 0.84 |
0 | navy_beans | Fresh1 | 0.385809 | 0.575997 | 0.970423 | vegetables | 0.65 |
0 | okra | Fresh1 | 0.35274 | 1.473146 | 3.21355 | vegetables | 0.769474 |
0 | olives | Fresh1 | 0.297624 | 1.246102 | 4.18683 | vegetables | 1 |
0 | onions | Fresh1 | 0.35274 | 0.406868 | 1.03811 | vegetables | 0.9 |
0 | pinto_beans | Fresh1 | 0.385809 | 0.514129 | 0.86619 | vegetables | 0.65 |
0 | potatoes | Fresh1 | 0.264555 | 0.184017 | 0.56432 | vegetables | 0.811301 |
0 | pumpkin | Fresh1 | 0.540133 | 0.730411 | 1.35228 | vegetables | 1 |
0 | radish | Fresh1 | 0.275578 | 0.401618 | 1.31163 | vegetables | 0.9 |
0 | red_peppers | Fresh1 | 0.264555 | 0.734926 | 2.27794 | vegetables | 0.82 |
0 | summer_squash | Fresh1 | 0.396832 | 0.845480 | 1.63948 | vegetables | 0.7695 |
0 | sweet_potatoes | Fresh1 | 0.440925 | 0.499400 | 0.918897 | vegetables | 0.811301 |
0 | turnip_greens | Fresh1 | 0.31967 | 1.053526 | 2.47175 | vegetables | 0.75 |
63 rows × 7 columns
Exercise 1.4. Discuss the questions below (a paragraph each is sufficient). Use plots to support your ideas.
- What kinds of fruits are the most expensive (per pound)? What kinds are the least expensive?
- How do the price distributions compare for fruit and vegetables?
- Which foods are the best value for the price?
- What’s something surprising about this data set?
- Which foods do you expect to provide the best combination of price, yield, and nutrition? A future assignment may combine this data set with another so you can check your hypothesis.
- Most expensive fruits are the berries (raspberries, blackberries, cranberries, blueberries) and figs/dates. The least expensive are bananas, watermelon, cantalope, and pineapple.
fruit_df.sort_values(['price_per_lb'], ascending = 0)[:9]
food | form | lb_per_cup | price_per_cup | price_per_lb | type | yield_v | |
---|---|---|---|---|---|---|---|
0 | raspberries | Fresh1 | 0.319670 | 2.322874 | 6.975811 | fruit | 0.96 |
0 | blackberries | Fresh1 | 0.319670 | 1.922919 | 5.774708 | fruit | 0.96 |
0 | figs | Fresh1 | 0.165347 | 0.990068 | 5.748318 | fruit | 0.96 |
0 | dates | Fresh1 | 0.165347 | 0.792234 | 4.791351 | fruit | 1.00 |
0 | cranberries | Fresh1 | 0.123163 | 0.589551 | 4.786741 | fruit | 1.00 |
0 | blueberries | Fresh1 | 0.319670 | 1.593177 | 4.734622 | fruit | 0.95 |
0 | cherries | Fresh1 | 0.341717 | 1.334548 | 3.592990 | fruit | 0.92 |
0 | berries_mixed | Fresh1 | 0.330693 | 1.127735 | 3.410215 | fruit | 1.00 |
0 | apricots | Fresh1 | 0.363763 | 1.189102 | 3.040072 | fruit | 0.93 |
fruit_df.sort_values(['price_per_lb'], ascending = 1)[:9]
food | form | lb_per_cup | price_per_cup | price_per_lb | type | yield_v | |
---|---|---|---|---|---|---|---|
0 | watermelon | Fresh1 | 0.330693 | 0.212033 | 0.333412 | fruit | 0.52 |
0 | cantaloupe | Fresh1 | 0.374786 | 0.393800 | 0.535874 | fruit | 0.51 |
0 | bananas | Fresh1 | 0.330693 | 0.292965 | 0.566983 | fruit | 0.64 |
0 | pineapple | Fresh1 | 0.363763 | 0.447686 | 0.627662 | fruit | 0.51 |
0 | honeydew | Fresh1 | 0.374786 | 0.649077 | 0.796656 | fruit | 0.46 |
0 | grapefruit | Fresh1 | 0.462971 | 0.848278 | 0.897802 | fruit | 0.49 |
0 | oranges | Fresh1 | 0.407855 | 0.578357 | 1.035173 | fruit | 0.73 |
0 | papaya | Fresh1 | 0.308647 | 0.646174 | 1.298012 | fruit | 0.62 |
0 | mangoes | Fresh1 | 0.363763 | 0.705783 | 1.377563 | fruit | 0.71 |
clean_df.boxplot("type", "price_per_lb")
plt.show()