Assignment 6

Introduction to SQL

Posted by Cindy Lai on March 19, 2017

Assignment 6

In this assignment, you’ll analyze a collection of data sets from the San Francisco Open Data Portal and Zillow. The data sets have been stored in the SQLite database sf_data.sqlite, which you can download here. The database contains the following tables:

Table Description
crime Crime reports dating back to 2010.
mobile_food_locations List of all locations where mobile food vendors sell.
mobile_food_permits List of all mobile food vendor permits. More details here.
mobile_food_schedule Schedules for mobile food vendors.
noise Noise complaints dating back to August 2015.
parking List of all parking lots.
parks List of all parks.
schools List of all schools.
zillow Zillow rent and housing statistics dating back to 1996. More details here.

The mobile_food_ tables are explicitly connected through the locationid and permit columns. The other tables are not connected, but you may be able to connect them using dates, latitude/longitude, or postal codes.

Shapefiles for US postal codes are available here. These may be useful for converting latitude/longitude to postal codes.

Shapefiles for San Francisco Neighborhoods are available here.

Exercise 1.1. Which mobile food vendor(s) sells at the most locations?

from sqlalchemy import create_engine
import pandas as pd

sf_conn = create_engine('sqlite:///sf_data.sqlite')

sql_query = """
select Applicant, count(distinct locationid) from mobile_food_permits
JOIN mobile_food_schedule ON mobile_food_schedule.permit = mobile_food_permits.permit
group by Applicant
order by count(distinct locationid) desc
"""

pd.read_sql_query(sql_query, sf_conn).head()
Applicant count(distinct locationid)
0 May Catering 58
1 Anas Goodies Catering 37
2 Natan's Catering 37
3 Liang Bai Ping 33
4 Park's Catering 23

From the SQL query, we can see that May Catering sells at the most locations (58 distinct locations).
Anas Goodies Catering and Natan’s Catering follow after. ___

Exercise 1.2. Ask and use the database to analyze 5 questions about San Francisco. For each question, write at least 150 words and support your answer with plots. Make a map for at least 2 of the 5 questions.

You should try to come up with some questions on your own, but these are examples of reasonable questions:

  • Which parts of the city are the most and least expensive?
  • Which parts of the city are the most dangerous (and at what times)?
  • Are noise complaints and mobile food vendors related?
  • What are the best times and places to find food trucks?
  • Is there a relationship between housing prices and any of the other tables?

Please make sure to clearly state each of your questions in your submission.

Which parts of the city are the most and least expensive?


import geopandas as gpd
#from mpl_toolkits.basemap import Basemap


shp_fn = 'cb_2015_us_zcta510_500k.shp'  #shapefile
cities = gpd.read_file(shp_fn)
zips = cities[cities.ZCTA5CE10.str.startswith("9")]
zips = zips[pd.to_numeric(zips.ZCTA5CE10) <= 96199]
# get the prices from the different regions in the zillow db
region_median_price = pd.read_sql_query("select RegionName, avg(zillow.MedianSoldPricePerSqft_AllHomes) median_sqft from zillow group by RegionName", sf_conn)
region_median_price.tail()
RegionName median_sqft
21 94131 560.122866
22 94132 385.384982
23 94133 638.636839
24 94134 347.027455
25 94158 657.211820
new_zips = zips.reset_index()

# join the shapefile columns with the zillow df to associate price with the map region
new_df = pd.merge(new_zips, region_median_price, how = 'right', left_on = pd.to_numeric(zips.ZCTA5CE10), right_on = 'RegionName')
new_df.head()
index AFFGEOID10 ALAND10 AWATER10 GEOID10 ZCTA5CE10 geometry RegionName median_sqft
0 997 8600000US94108 698155 0 94108 94108 POLYGON ((-122.414826 37.794988, -122.404412 3... 94108 501.013201
1 5104 8600000US94104 200857 0 94104 94104 POLYGON ((-122.404613 37.793565, -122.401315 3... 94104 1336.128581
2 5154 8600000US94122 6124846 0 94122 94122 POLYGON ((-122.50987 37.76409, -122.508335 37.... 94122 453.210167
3 5156 8600000US94158 1703879 1342698 94158 94158 POLYGON ((-122.397866 37.772323, -122.396381 3... 94158 657.211820
4 6078 8600000US94110 6019920 12207 94110 94110 POLYGON ((-122.426722 37.736372, -122.425082 3... 94110 543.998864
%matplotlib inline
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, figsize=(4,6))
plt.title("Median Price per Sq. Ft in San Francisco")
plt.ylabel("Latitude")
fig.autofmt_xdate()

new_df.plot(ax = ax, column='median_sqft', scheme='QUANTILES', legend = True)

png

Ignore the purple piece behind the legend.
According to the color-coded map, the areas in brown and yellow are the most expensive areas of SF. The dark purple are the least expensive areas.
We can see that the majority of purple areas are on the bottom part of SF, while the expensive areas are concentrated in the top right. Some of the expensive neighborhoods could be Mission Bay, Presidio/Presidio Heights, and North Beach.


Which parts of the city are the most dangerous (and at what times)?


Since it is hard to come up with a methodical method to objectively quantify how severe the crimes are, I will treat all crimes equally and rate by crime quantity only.
I did exclude non-criminal incidences, but that did not change the ordering.

# get the districts sorted by the number of crime incidences over all time
pd.read_sql_query("select crime.PdDistrict, count(crime.IncidntNum) from crime where crime.Category != 'NON-CRIMINAL' group by crime.PdDistrict order by count(crime.IncidntNum) desc", sf_conn)
PdDistrict IncidntNum)
0 SOUTHERN 170602
1 MISSION 119729
2 NORTHERN 110707
3 CENTRAL 94437
4 BAYVIEW 93121
5 INGLESIDE 80217
6 TENDERLOIN 73430
7 TARAVAL 65774
8 PARK 51486
9 RICHMOND 48195
10 None 1
# looking at recent years, 2015 onwards
query = """
select crime.PdDistrict, count(crime.IncidntNum) as '# of Crime Incidences 2015-2016' from crime 
where crime.Category != 'NON-CRIMINAL' and crime.Datetime > '2015-01-01'
group by crime.PdDistrict
order by count(crime.IncidntNum) desc
"""
pd.read_sql_query(query, sf_conn)
PdDistrict # of Crime Incidences 2015-2016
0 SOUTHERN 54189
1 NORTHERN 38172
2 MISSION 35586
3 CENTRAL 33146
4 BAYVIEW 27623
5 INGLESIDE 23434
6 TARAVAL 21279
7 TENDERLOIN 18698
8 PARK 16253
9 RICHMOND 16194
10 None 1
# see what types of crimes are most common in the top-crime regions
crime_type_counts = pd.read_sql_query("""select PdDistrict, Category, count(Category) from crime 
where crime.PdDistrict in ('SOUTHERN', 'NORTHERN', 'MISSION') and Datetime > '2015-01-01'
group by PdDistrict, crime.Category
order by count(crime.Category) desc""", sf_conn)
crime_type_counts.head(3)
PdDistrict Category count(Category)
0 SOUTHERN LARCENY/THEFT 21122
1 NORTHERN LARCENY/THEFT 15697
2 SOUTHERN NON-CRIMINAL 7239

To see the differences across the 3 regions (Southern, Northern, and Mission), I created a barplot divided by the most common crimes, where each region is a separate color.

import seaborn as sns

g = sns.factorplot(x="Category", y="count(Category)", hue="PdDistrict", data=crime_type_counts.head(21),
                   size=6, kind="bar", palette="muted")

plt.xticks(rotation=30)
plt.xlabel("Crime Category")
plt.title("Crime Distribution Across Southern, Northern, Mission 2015-2016")

g.despine(left=True)
g.set_ylabels("Count")
<seaborn.axisgrid.FacetGrid at 0x15e59ac8>

png

From the tables above, we can see which districts are the most dangerous (highest number of crime incidents).
From our findings, Southern is significantly more dangerous (by number of incidences), followed by Mission and Northern.
Interestingly, when you look at crimes from 2015 onwards, Northern has a higher number of incidences than Mission. Mission overall, had 8% more crimes than Northern, but in recent years, Northern has 7% more than Mission.
This could indicate that one neighborhood (Northern) got more dangerous recently and/or Mission got more safe. The most common crime is theft. Glancing at the graph, Mission crime types seem more serious (i.e. vehicle theft) compared to Northern. Northern only comes higher overall because of the number of thefts. However over most of the other categories, Mission has higher counts.

# don't look at non-criminal
datetimes = pd.read_sql_query("select crime.Datetime from crime where crime.Category != 'NON-CRIMINAL'", sf_conn)
#datetimes.head()
datetimes = list(datetimes["Datetime"])
# extract out the times crimes usually occur at

from datetime import datetime

times = [i[11:] for i in datetimes] # strip off the dates and only get the times
times = [datetime.strptime(entry, '%H:%M:%S') for entry in times] # convert to time type
plt.xticks(rotation=15)
plt.hist(times, bins = 24)
plt.xlabel("Time")
plt.ylabel("Frequency")
plt.title("Time Distribution of Crimes in San Francisco, by Hour")
plt.show()

png

For times, we can see that it peaks at 7:00PM-8:00PM, followed by 12:00AM-1:00AM. Most of the concentration is from afternoon to midnight. There is a dip in the morning.


# initial look at entries that listed "mobile_food_facility" as the type of noise
food_noise = pd.read_sql_query("select Datetime, Lat, Lon, Neighborhood from noise where Type == 'mobile_food_facility'", sf_conn)
food_noise # there's only 7 entries out of thousands!
Datetime Lat Lon Neighborhood
0 2016-10-08 15:39:50 37.768141 -122.419785 Mission
1 2016-06-25 23:45:10 37.761364 -122.434097 Castro
2 2016-08-30 21:57:31 37.780362 -122.409218 South of Market
3 2016-10-04 17:43:25 37.768141 -122.419785 Mission
4 2015-09-02 10:57:17 37.773512 -122.450718 Panhandle
5 2016-03-01 06:24:48 37.769046 -122.413270 Mission
6 2016-11-03 16:41:55 37.768402 -122.419721 Mission
# generate points for all the food truck locations
truck_locs = pd.read_sql_query("select Latitude, Longitude from mobile_food_locations where Latitude > 30", sf_conn)

import shapely.geometry as geom

lonlat = [geom.Point(lon, lat) for lon, lat in zip(truck_locs.Longitude, truck_locs.Latitude)]
noise_points = truck_locs.drop(["Latitude", "Longitude"], axis = 1)

noise_points = gpd.GeoDataFrame(noise_points, geometry = lonlat)
# generate zip codes based off longitude and latitude
import geocoder

noise = pd.read_sql_query("select Lon, Lat, Neighborhood, Type from noise where Type in ('mobile_food_facility', 'other_excessive_noise') and Lon is not NULL", sf_conn)

noise["zip"] = [(geocoder.google([lat, lon], method='reverse')).postal for lon,lat in zip(noise.Lon, noise.Lat)]
# generate a count for how many noise complaints are in each zip area
noise_counts = pd.DataFrame(pd.Series(noise["zip"]).value_counts()).reset_index()
noise_counts.head()
# because of value_counts, the column names are off
index zip
0 94122 138
1 94109 82
2 94110 79
3 94112 69
4 94103 63
sf_zips = noise_counts['index'] # for later use. stores (almost) all zips in the SF region
sf_zips = pd.to_numeric(sf_zips)
# merge shapefile df with noise complaints df
noise_locs = pd.merge(new_zips, noise_counts, how = 'right', left_on = pd.to_numeric(zips.ZCTA5CE10), right_on = pd.to_numeric(noise_counts['index']))
# plot num of noise complaints with food truck points
base = noise_locs.plot(column='zip', scheme='QUANTILES', legend = True, cmap = 'GnBu')
plt.title("Number of Noise Complaints and Food Truck Locations")
plt.ylabel("Latitude")
plt.xlabel("Longitude")
fig.autofmt_xdate()

noise_points.plot(ax = base, color = "orangered", markersize = 4, alpha = 0.5)
<matplotlib.axes._subplots.AxesSubplot at 0x28bea6d8>

png

For this problem, I looked specifically at noise complaints under the type “mobile_food_facility” and “other_execessive_noise” because there were only 7 entries under mobile food, and I felt like at food trucks, there would be noise associated with it (i.e. noisy people eating and talking at the food truck).
I judged correlation based on locations (where noise complaints were, and where food trucks were). The map divides up number of noise complaints by zip code, and color codes by intensity. The darker shades are ‘noisier’. The orange/red dots are where the food trucks are located.
There may be a correlation between noise and food trucks, but the correlation is not strong. If you look at the top right region where there is a cluster of food trucks, that is one of the lowest noise complaint areas. There are a few purple areas that also have a high number of food trucks.
In addition, we have to take into account that ‘other_execessive_noise’ is very broad, and not all will be related to food trucks.

What are the best times and places to find food trucks?


query = """select StartHour, EndHour, Latitude as Lat, Longitude as Lon from mobile_food_schedule
JOIN mobile_food_locations ON mobile_food_locations.locationid = mobile_food_schedule.locationid
where latitude > 30
group by mobile_food_schedule.locationid"""

truck_time_loc = pd.read_sql_query(query, sf_conn)

lonlat = [geom.Point(lon, lat) for lon, lat in zip(truck_time_loc.Lon, truck_time_loc.Lat)]
truck_points = truck_time_loc.drop(["Lat", "Lon"], axis = 1)

truck_points = gpd.GeoDataFrame(truck_points, geometry = lonlat)
import numpy as np

def hour_range(start, end):
    '''
    Return a numpy array with the range of hours a specific food truck is open.
    '''
    if start < end:
        return np.arange(start, end + 1)
    
    # when a food truck is open past midnight
    else:
        end = end + 24
        arr = np.arange(start, end + 1)
        arr = np.remainder(arr, 24)
        return arr
open_hours = [hour_range(StartHour, EndHour) for StartHour, EndHour in zip(truck_points.StartHour, truck_points.EndHour)]
open_hours[0:5]
[array([17, 18, 19, 20], dtype=int64),
 array([ 8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18], dtype=int64),
 array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], dtype=int64),
 array([ 6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16], dtype=int64),
 array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19], dtype=int64)]
all_hours = np.concatenate(open_hours)
plt.hist(all_hours)
plt.title("Open Hours for all Food Trucks in SF")
plt.ylabel("Count")
plt.xlabel("Hour of the Day")
<matplotlib.text.Text at 0x2e826940>

png

According to the histogram above, you can get food from food trucks during all times of the day.
However, the most popular times are from 8AM to 3PM.
It is assumed that the open hours is generally the same across areas in SF.

# plot locations of all food trucks
SF_shapes = new_zips.loc[pd.to_numeric(new_zips['ZCTA5CE10']).isin(sf_zips)]
base = SF_shapes.plot(color = 'white')
truck_points.plot(ax = base, markersize = 5, alpha = 0.5, color = 'coral')
plt.title("Food Truck Locations in SF")
plt.ylabel("Latitude")
plt.xlabel("Longitude")
<matplotlib.text.Text at 0x2f279c88>

png

This is a map of San Francisco with all the food truck locations. We can see a high concentration on the east side, particularly near Financial District.

Is there a relationship between housing prices and any of the other tables?



First, look at private school quantities vs. median prices.

private_school_loc = pd.read_sql_query("select Lat, Lon from schools where Category == 'Independent / Private'", sf_conn)

lonlat = [geom.Point(lon, lat) for lon, lat in zip(private_school_loc.Lon, private_school_loc.Lat)]
private_school_loc = gpd.GeoDataFrame(private_school_loc, geometry = lonlat, crs = {'init' :'epsg:4326'})
new_df = new_df.loc[pd.to_numeric(new_df['ZCTA5CE10']).isin(sf_zips)]
base = new_df.plot(column='median_sqft', scheme='QUANTILES', legend = True, cmap = 'YlGnBu')

private_school_loc = private_school_loc.to_crs(new_df.crs)
private_school_loc.plot(ax = base, markersize = 5, alpha = 0.5, color = 'coral')

plt.title("Home Prices and Private Schools Across SF")
plt.ylabel("Latitude")
plt.xlabel("Longitude")
<matplotlib.text.Text at 0x33dbf630>

png

There are a few points that are off the map. After researching a bit, They belong to the Presidio area and Treasure Island, which did not have any entries inside the Zillow database. Therefore, there was no boundary for that area. There is somewhat of a correlation between private schools and median price per sq ft. We can see there are more private schools in more expensive areas (which makes sense).
The least expensive areas, biege, appear to have the fewest.

query = "select Lat, Lon from crime where Datetime > '2015-01-01' and Category != 'NON-CRIMINAL'"
crime_locs = pd.read_sql_query(query, sf_conn)
lonlat = [geom.Point(lon, lat) for lon, lat in zip(crime_locs.Lon, crime_locs.Lat)]
crime_locs = gpd.GeoDataFrame(crime_locs, geometry = lonlat)
# looking at recent years, 2015 onwards
query = """
select crime.PdDistrict, count(crime.IncidntNum) as '# of Crime Incidences 2015-2016' from crime 
where crime.Category != 'NON-CRIMINAL' and crime.Datetime > '2015-01-01'
group by crime.PdDistrict
order by count(crime.IncidntNum) desc
"""
pd.read_sql_query(query, sf_conn)
PdDistrict # of Crime Incidences 2015-2016
0 SOUTHERN 54189
1 NORTHERN 38172
2 MISSION 35586
3 CENTRAL 33146
4 BAYVIEW 27623
5 INGLESIDE 23434
6 TARAVAL 21279
7 TENDERLOIN 18698
8 PARK 16253
9 RICHMOND 16194
10 None 1
base = new_df.plot(column='median_sqft', scheme='QUANTILES', legend = True, cmap = 'YlGnBu')

#crime_locs.plot(ax = base, markersize = 5, alpha = 0.5, color = 'coral')

plt.title("Home Prices Across SF")
plt.ylabel("Latitude")
plt.xlabel("Longitude")
<matplotlib.text.Text at 0x13757828>

png

Since crime_locs, which stores all the points for crimes 2015-present, is over 28,000 rows, I was not able to plot them. Therefore, I just look at the counts of incidences divided by PD District and the median prices map. When looking at a map of PD Districts in SF (http://sanfranciscopolice.org/police-district-maps), Southern, Northern, and Mission are mostly in the north-east. This is also where prices are medium to high.
It might be because there are richer people in those areas, more crimes occur.