Assignment 4
In this assignment, you’ll combine the assignment 3 data set with nutrition data from the USDA Food Composition Databases. The CSV file fresh.csv
contains the fresh fruits and vegetables data you extracted in assignment 3.
The USDA Food Composition Databases have a documented web API that returns data in JSON format . You need a key in order to use the API. Only 1000 requests are allowed per hour, so it would be a good idea to use caching.
Sign up for an API key here. The key will work with any Data.gov API. You may need the key again later in the quarter, so make sure you save it.
These modules may be useful:
Exercise 1.1. Read the search request documentation, then write a function called ndb_search()
that makes a search request. The function should accept the search term as an argument. The function should return the search result items as a list (for 0 items, return an empty list).
Note that the search url is: https://api.nal.usda.gov/ndb/search
As an example, a search for "quail eggs"
should return this list:
[{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'CHAOKOH, QUAIL EGG IN BRINE, UPC: 044738074186',
u'ndbno': u'45094707',
u'offset': 0},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'L&W, QUAIL EGGS, UPC: 024072000256',
u'ndbno': u'45094890',
u'offset': 1},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'BUDDHA, QUAIL EGGS IN BRINE, UPC: 761934535098',
u'ndbno': u'45099560',
u'offset': 2},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'GRAN SABANA, QUAIL EGGS, UPC: 819140010103',
u'ndbno': u'45169279',
u'offset': 3},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u"D'ARTAGNAN, QUAIL EGGS, UPC: 736622102630",
u'ndbno': u'45178254',
u'offset': 4},
{u'ds': u'SR',
u'group': u'Dairy and Egg Products',
u'name': u'Egg, quail, whole, fresh, raw',
u'ndbno': u'01140',
u'offset': 5}]
As usual, make sure you document and test your function.
from urllib2 import Request, urlopen
from urlparse import urlparse, urlunparse
import requests, requests_cache
import pandas as pd
import json
# my API key
key = "ULxnv6kWU0vTif6L3wHrB5MIkQKj0PrM3IfgfWbG"
requests_cache.install_cache('food_cache')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-10-1e21dcdbe5e7> in <module>()
----> 1 requests_cache.install_cache('food_cache')
NameError: name 'requests_cache' is not defined
# generate the query request to search for results
# Notes: The request will return a JSON object. Default sort = "r", aka relevance
# EXAMPLE URL: https://api.nal.usda.gov/ndb/search/?api_key=ULxnv6kWU0vTif6L3wHrB5MIkQKj0PrM3IfgfWbG&format=json&q=mushroom
maxResults = 5
def get_request(key, searchObject):
urlbase = "https://api.nal.usda.gov/ndb/search/"
params = {"q":searchObject}
params.update({"format":"json"})
params.update({"api_key":key})
params.update({"max":maxResults})
return requests.get(urlbase,params=params)
get_request(key, "mushroom")
<Response [200]>
def ndb_search(searchObject):
searchList = []
request = get_request(key, searchObject)
searchList = request.json()
return searchList[u'list'][u'item']
ndb_search("mushroom")
[{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'MUSHROOM GRAVY MADE WITH REAL MUSHROOMS, UPC: 011213168265',
u'ndbno': u'45077161',
u'offset': 0},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'BARTOLINI, EXTRA VIRGIN OLIVE OIL WITH MUSHROOM OIL, PORCINI MUSHROOM OIL, UPC: 657739001541',
u'ndbno': u'45168655',
u'offset': 1},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'MUSHROOM HOUSE, DRIED OYSTER MUSHROOMS, UPC: 084348369166',
u'ndbno': u'45184863',
u'offset': 2},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'MUSHROOM HOUSE, SHIITAKE DRIED MUSHROOMS, UPC: 084348374160',
u'ndbno': u'45184864',
u'offset': 3},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'MUSHROOM HOUSE, DRIED MUSHROOMS, UPC: 084348396162',
u'ndbno': u'45184865',
u'offset': 4}]
Exercise 1.2. Use your search function to get NDB numbers for the foods in the fresh.csv
file. It’s okay if you don’t get an NDB number for every food, but try to come up with a strategy that gets most of them. Discuss your strategy in a short paragraph.
Hints:
- The foods are all raw and unbranded.
- You can test search terms with the online search page.
- You can convert the output of
ndb_search()
to a data frame withpd.DataFrame()
. - The string methods for Python and Pandas are useful here. It’s okay if you use simple regular expressions in the Pandas methods, although this exercise can be solved without them.
- You can merge data frames that have a column in common with
pd.merge()
.
Note: I use both raw_fresh and fresh throughout the code. This was because I later updated the rows of fresh to sort by food name alphabetically, so it messed up the ordering. I use raw_fresh for the old index order.
raw_fresh = pd.read_csv("fresh.csv")
fresh = raw_fresh.sort_values(by = "food")
fresh.head()
form | price_per_lb | yield | lb_per_cup | price_per_cup | food | type | |
---|---|---|---|---|---|---|---|
34 | Fresh1 | 1.172248 | 0.458554 | 0.451948 | 1.155360 | acorn_squash | vegetables |
15 | Fresh1 | 1.567515 | 0.900000 | 0.242508 | 0.422373 | apples | fruit |
9 | Fresh1 | 3.040072 | 0.930000 | 0.363763 | 1.189102 | apricots | fruit |
44 | Fresh1 | 2.213050 | 0.375309 | 0.385809 | 2.274967 | artichoke | vegetables |
24 | Fresh1 | 3.213494 | 0.493835 | 0.396832 | 2.582272 | asparagus | vegetables |
After searching on the online page, I noticed the majority have the word are formatted “[fruit/vegetable], raw”, so I will adjust my search to include that in the parameter. Entries with more than one word are separted by a “, “.
In order to find the fresh fruit or vegetable you are looking for, I am modifying the search term to make it appear on top. Modifications include:
- Adding ', raw' to the end of every entry
- Splitting multi-word entries with a comma
- Dealing with special cases
I wrote a function called modify_search_term which will deal with multi-word entries by splitting them up and adding “raw” at the end of each term. check_if_found makes sure that we get the item that we want. It goes to look into the food item’s group to make sure it belongs to the “Vegetables and Vegetable Products” group or the “Fruits and Fruit Juices” group, and not some branded entry.
grab_correct_entry builds off check_if_found and goes through the search result list and returns the first entry which check_if_found is equal to True.
These functions culminate to get_ndbno, which takes an entry from the fresh.csv and outputs its corresponding nbd number.
The output is generated in the for loop, which creates a dictionary of foods and their ndb number. The for loop passes whenever get_ndb does not output a favorable result.
def modify_search_term(food):
searchTerm = food
if "_" in food: # two+ word i.e. "collard_greens"
searchTerm = ", ".join(food.split('_'))
return searchTerm + ", raw"
modify_search_term("lettuce_iceberg")
'lettuce, iceberg, raw'
Write something to deal with when the group isn’t Vegetables and Vegetable Products or Fruits and Fruit Products. Raise a flag to indicate that the raw fruit wasn’t found.
def check_if_found(foodJson):
foodGroup = foodJson[u'group']
if (foodGroup == u'Vegetables and Vegetable Products') or (foodGroup == u'Fruits and Fruit Juices'):
return True
else:
return False
def grab_correct_entry(jsonList):
for result in jsonList:
if check_if_found(result):
return result
return None
# main function to get ndb number
def get_ndbno(entry):
searchTerm = modify_search_term(entry)
searchResults = ndb_search(searchTerm)
foodEntry = grab_correct_entry(searchResults)
if foodEntry is None:
return float('NaN')
return foodEntry['ndbno']
lettuce = get_ndbno("lettuce_iceberg")
get_ndbno('watermelon')
u'09326'
# write a function to handle duplicates (cabbage and cucumbers)
# input: takes the index for the row the food is located in raw_fresh
# outputs new search term
def look_for_details(ind):
freshForm = raw_fresh.loc[ind][0]
detail = freshForm.split(" ")[1:][0]
#return detail
if detail[len(detail) - 1] == "1":
detail = detail[:(len(detail) - 1)]
return fresh.loc[ind]['food'] + ", " + detail
look_for_details(26)
'cucumbers, peeled'
This function does not grab as many terms if I were to make it more specific. However, I kept it more general in case the data gets larger, so the look_for_details function would be able to grab more instances.
This goes through all of the food in the fresh csv. It tries to get all the ndb numbers by using the function get_ndbno(). If a duplicate is found, it will use the function look_for_details() and modify the term that you will be searching for (cabbage -> red cabbage). Assumes that the first instance of duplicate pair will be original (uses cabbage) and 2nd will be a variation (red cabbage).
ndbNumbers = {}
badRows = []
for idx, food in enumerate(raw_fresh['food']):
if food in ndbNumbers.keys(): # duplicate
food = look_for_details(idx)
try:
ndbNumbers[food] = get_ndbno(food)
except:
badRows.append(idx)
pass
ndbNumbers # didn't get kiwi, and cucumbers is wrong (not the consumed with peel)
{'acorn_squash': u'11482',
'apples': u'09003',
'apricots': u'09021',
'artichoke': u'11226',
'asparagus': u'11011',
'avocados': u'09038',
'bananas': u'09040',
'blackberries': u'09042',
'blueberries': u'09050',
'brussels_sprouts': u'11098',
'butternut_squash': u'11485',
'cabbage': u'11503',
'cabbage, red': u'11112',
'cantaloupe': u'09181',
'cherries': u'09276',
'collard_greens': u'11161',
'corn_sweet': u'11900',
'cucumbers': u'11206',
'cucumbers, peeled': u'11206',
'grapefruit': u'09117',
'grapes': u'09129',
'green_beans': u'11052',
'green_peppers': u'11333',
'honeydew': u'09184',
'kale': u'11233',
'lettuce_iceberg': u'11252',
'mangoes': u'09176',
'mustard_greens': u'11270',
'nectarines': u'09191',
'okra': u'11278',
'onions': u'11282',
'oranges': u'09201',
'papaya': u'09226',
'peaches': u'09236',
'pears': u'09252',
'pineapple': u'09266',
'plums': u'09279',
'pomegranate': u'09286',
'potatoes': u'11362',
'radish': u'11429',
'raspberries': u'09302',
'red_peppers': u'11821',
'strawberries': u'09316',
'summer_squash': u'11475',
'sweet_potatoes': u'11507',
'tangerines': u'09221',
'turnip_greens': u'11568',
'watermelon': u'09326'}
ndb_df = pd.DataFrame(ndbNumbers.values(), index = ndbNumbers.keys(), columns = ['ndbNo'])
ndb_df.head()
ndbNo | |
---|---|
red_peppers | 11821 |
bananas | 09040 |
lettuce_iceberg | 11252 |
radish | 11429 |
corn_sweet | 11900 |
ndb_df = ndb_df.sort_index()
ndb_df.head()
ndbNo | |
---|---|
acorn_squash | 11482 |
apples | 09003 |
apricots | 09021 |
artichoke | 11226 |
asparagus | 11011 |
Exercise 1.3. Read the food reports V2 documentation, then write a function called ndb_report()
that requests a basic food report. The function should accept the NDB number as an argument and return the list of nutrients for the food.
Note that the report url is: https://api.nal.usda.gov/ndb/V2/reports
For example, for "09279"
(raw plums) the first element of the returned list should be:
{u'group': u'Proximates',
u'measures': [{u'eqv': 165.0,
u'eunit': u'g',
u'label': u'cup, sliced',
u'qty': 1.0,
u'value': u'143.93'},
{u'eqv': 66.0,
u'eunit': u'g',
u'label': u'fruit (2-1/8" dia)',
u'qty': 1.0,
u'value': u'57.57'},
{u'eqv': 151.0,
u'eunit': u'g',
u'label': u'NLEA serving',
u'qty': 1.0,
u'value': u'131.72'}],
u'name': u'Water',
u'nutrient_id': u'255',
u'unit': u'g',
u'value': u'87.23'}
Be sure to document and test your function.
def ndb_report(ndbNumber):
url = "https://api.nal.usda.gov/ndb/V2/reports/"
querystring = {"ndbno":ndbNumber,"api_key":"ULxnv6kWU0vTif6L3wHrB5MIkQKj0PrM3IfgfWbG","format":"json","type":"b"}
response = requests.get(url, params=querystring)
return response.json()[u'foods'][0]['food']['nutrients']
ndb_report(11112) # red cabbage
Exercise 1.4. Which foods provide the best combination of price, yield, and nutrition? You can use kilocalories as a measure of “nutrition” here, but more a detailed analysis is better. Use plots to support your analysis.
I based off what is “healthy” from Energy (kcal), Fiber, Sugars (negative health), and Vitamin C. I used the formula: health = energy + fiber + vitaminC - sugars/2
prices = fresh['price_per_lb'].reindex(fresh.index)
priceList = prices.tolist()
good_prices = [priceList[i] for i in xrange(len(priceList)) if i not in badRows]
# note! this does enter the wrong price for cucumbers, with peel
ndb_df['price per lb'] = good_prices
types = fresh['type'].reindex(fresh.index)
typeList = types.tolist()
good_types = [typeList[i] for i in xrange(len(typeList)) if i not in badRows]
ndb_df['type'] = good_types
ndb_df["Nutrients"] = [ndb_report(x) for x in ndb_df['ndbNo']]
nut_keys = [1, 6, 5, 14]
# energy, sugars, fiber, vitamin C
# takes in a nutrition report and a nut_keys index, and returns the nutrient value
def nutrient_info_extractor(food, idx):
return float(food[idx]['value'])
ndb_df['energy'] = [nutrient_info_extractor(fd, nut_keys[0]) for fd in ndb_df["Nutrients"]]
ndb_df['sugar'] = [nutrient_info_extractor(fd, nut_keys[1]) for fd in ndb_df["Nutrients"]]
ndb_df['fiber'] = [nutrient_info_extractor(fd, nut_keys[2]) for fd in ndb_df["Nutrients"]]
ndb_df['vitamin C'] = [nutrient_info_extractor(fd, nut_keys[3]) for fd in ndb_df["Nutrients"]]
Heath Index
Now I calculate the “health index”, which takes calories, fiber, vitamin C, and sugars into consideration. I chose these elements because these are some very common nutrients, known to the general public. People usually want fibers and high in vitamin C, and want to stay away from sugars (sugars is my penalty term). In addition, I consulted a couple of nutrition articles to see what they usually rate as a nutritional food. I wanted to keep the index simple, so I did not include many variables. <p> Note:I divided sugars by 2, so it won’t have such a dramatic impact on the health index.
ndb_df['health index'] = ndb_df['energy'] + ndb_df['fiber'] + ndb_df['vitamin C'] - ndb_df['sugar']/2.0
ndb_df.head()
ndbNo | price per lb | Nutrients | energy | sugar | fiber | vitamin C | health index | type | |
---|---|---|---|---|---|---|---|---|---|
acorn_squash | 11482 | 1.172248 | [{u'group': u'Proximates', u'name': u'Water', ... | 40.0 | 33.00 | 1.5 | 0.14 | 25.140 | vegetables |
apples | 09003 | 1.567515 | [{u'group': u'Proximates', u'name': u'Water', ... | 52.0 | 10.39 | 2.4 | 4.60 | 53.805 | fruit |
apricots | 09021 | 3.040072 | [{u'group': u'Proximates', u'name': u'Water', ... | 48.0 | 9.24 | 2.0 | 10.00 | 55.380 | fruit |
artichoke | 11226 | 2.213050 | [{u'group': u'Proximates', u'name': u'Water', ... | 73.0 | 9.60 | 1.6 | 4.00 | 73.800 | vegetables |
asparagus | 11011 | 3.213494 | [{u'group': u'Proximates', u'name': u'Water', ... | 20.0 | 1.88 | 2.1 | 5.60 | 26.760 | vegetables |
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
chart1 = sns.regplot(x='price per lb', y='health index', data=ndb_df, ci=None, fit_reg = False, scatter_kws={'s':30})
sns.plt.title('Health Index (kcal, sugars, Vitamin C, fiber) vs. Price')
for label, x, y in zip(ndb_df.index, ndb_df['price per lb'], ndb_df['health index']):
plt.annotate(
label,
xy=(x, y), ha='center', va='bottom',
bbox=dict(boxstyle='round,pad=0.1', fc='white', alpha=0.5))
plt.show()
# plotting price vs. health index
If we base value off of price and what is considered ‘nutritious’ (health index), then according to the plot above, we should look at the foods in the top left region. We can see
- avocados
- kale
- red peppers
- brussels spouts
in the top rankings. Some other interesting foods to point out are raspberries, which stand out as a very expensive food with mediocre nutritional value. The berries in general are on the right, with mediocre health indices. Green Cabbage (labled as cabbage in the graph) is rather cheap, but surprisingly not that healthy (with a negative index).
If you generally care more for price, you should look at the left side. Green peppers, pineapples, sweet potatoes are some cheap, high in nutrient foods.
chart2 = sns.regplot(x='price per lb', y='energy', data=ndb_df, ci=None, fit_reg = False, scatter_kws={'s':30})
sns.plt.title('Energy (kcal) vs. Price')
for label, x, y in zip(ndb_df.index, ndb_df['price per lb'], ndb_df['energy']):
plt.annotate(
label,
xy=(x, y), ha='center', va='bottom',
bbox=dict(boxstyle='round,pad=0.1', fc='white', alpha=0.5))
plt.show()
When we take a simpler approach and simply base nutrition off of Energy (kcal), then you can see that avocados definitely takes the lead.
chart3 = sns.FacetGrid(ndb_df, hue = "type", size = 3)
chart3.map(plt.scatter, "price per lb", "energy").add_legend()
sns.plt.title('Energy (kcal) vs. Price')
plt.show()
There is not really a differennce between fruits and vegetables for calories.
sns.boxplot("health index", "type", data = ndb_df)
sns.plt.title('Distribution of Health, Vegetables vs. Fruit')
plt.show()
Fruits have a more limited range of health, but have a higher average health index than vegetables.
Sites used:
- http://stackoverflow.com/questions/5188792/how-to-check-a-string-for-specific-characters
- http://www.pythonforbeginners.com/dictionary/python-split
- http://stackoverflow.com/questions/5438745/python-nan-and-inf-values
- http://stackoverflow.com/questions/8312829/how-to-remove-item-from-a-python-list-if-a-condition-is-true
- http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_index.html
- http://stackoverflow.com/questions/30009948/how-to-reorder-indexed-rows-based-on-a-list-in-pandas-data-frame
- http://stackoverflow.com/questions/5147112/matplotlib-how-to-put-individual-tags-for-a-scatter-plot