Construction permit data in Boston - where are investments going?

I have started looking for a property in Boston. With many of its self-claimed “up and coming” neighborhoods, it can be tricky to locate the next most profitable investment. Of course one way to make money is to chase money; where there are already lots of investment would probably see the highest appreciation. Therefore, I leveraged the Boston construction permits data published by the /data.boston.gov to hopeful identify where the investments are going

import pandas as pd
import geopandas as gpd
from geopandas import GeoDataFrame
import numpy as np
import matplotlib.pyplot as plt
from shapely.geometry import Point
from shapely.geometry import Polygon
import os
import datetime as dt
import contextily as ctx
import rasterio as rio

os.chdir(r'C:\Users\zkuang\PycharmProjects\FIndAHouse')
house_only = True # only look at new houses
new_construnction_only = False # only look at new construction

Geopandas is a spatial analysis tool in python. I had problems installing it with pip but anaconda did the magic. Contextily is used for adding the basemap.

Permit data

permit_raw = pd.read_csv(r'tmpvnqj4wz3.csv')
permit = permit_raw.drop(['permittypedescr','comments','applicant','status','owner','state','property_id','parcel_id'],axis = 1)
permit['issued_year'] = [int(i.split('-')[0]) for i in permit.issued_date] # add a permit issued year num
permit.head()
permitnumber worktype description declared_valuation total_fees issued_date expiration_date occupancytype sq_feet address city zip lat long issued_year
0 A100071 COB City of Boston 40000.00 429.0 2011-11-04 11:04:58 2012-05-04 00:00:00 Comm 170.0 175 W Boundary RD West Roxbury 02132 42.26075 -71.14961 2011
1 A1001201 INTEXT Interior/Exterior Work 74295.75 803.0 2019-11-13 13:38:56 2020-05-13 00:00:00 Multi 0.0 211 W Springfield ST Roxbury 02118 42.34060 -71.08025 2019
2 A100137 EXTREN Renovations - Exterior 15000.00 206.0 2013-01-03 14:13:09 2013-07-03 00:00:00 1-2FAM 0.0 14 William Jackson AVE Brighton 02135 42.34460 -71.15405 2013
3 A1001913 INTREN Renovations - Interior NSC 1.00 33.0 2019-10-18 09:21:00 2020-04-18 00:00:00 Comm 0.0 130-140 Brighton AVE Allston 02134 42.35276 -71.13185 2019
4 A1002445 INTREN Renovations - Interior NSC 50000.00 556.0 2019-11-12 11:00:03 2020-05-12 00:00:00 1-2FAM 0.0 7 Amboy ST Allston 02134 42.35971 -71.12433 2019
permit.shape
(449358, 15)

Each permit comes with some key information such as it’s location (long, lat), description, occupancy type and declared value that are relevant to our quest. There are quite a lot of permits outstanding (449358, well done BOS).

We noticed a lot of projects that has 0 declared value that are not relevant in our quest. We plot a declared value distribution without those entries. We noticed a long tail even on the log scale — some of the property values are just way up there!

np.log(permit[permit.declared_valuation > 10].declared_valuation).plot(
    kind='hist',title = 'Distribution declared values on log scale', bins = 20)
<matplotlib.axes._subplots.AxesSubplot at 0x24d4d913d08>

permit.occupancytype.value_counts().head(5)
1-2FAM    140025
Comm      138222
Multi      47668
1-3FAM     46674
Mixed      17944
Name: occupancytype, dtype: int64

The most common occupancy type 1-2 family houses, followed by commercial constructions and multi family properties.

permit.description.value_counts().head(5)
Electrical                    93709
Plumbing                      59700
Gas                           44368
Renovations - Interior NSC    39010
Low Voltage                   33194
Name: description, dtype: int64
permit.groupby('description').mean()['declared_valuation'].sort_values(ascending = False).head(10)
description
New construction          8.401808e+06
Erect                     7.591059e+06
Addition                  1.613690e+06
Fast Track Application    1.082881e+06
City of Boston            8.492774e+05
Change Occupancy          6.806303e+05
Commercial Parking        5.283526e+05
Industrial Boiler         2.647291e+05
Interior/Exterior Work    2.328353e+05
Industrial Furnace        2.141444e+05
Name: declared_valuation, dtype: float64

The most common project type is electrical work. Not exactly construction here. Although expectedly, the most valualbe projects are new constructions, erect, additions

permit.issued_year.plot(kind='hist')

Finally, we found most of the permits are issued after 2010, which means the corresponding project is likely to still be relevant to today’s market (2019).

Create maps using grids

With the baseline understanding of our data, we are ready to create the value plot. First, we define our help functions to create the grids we needed for spatial merge.

def create_grids(geo_df,n = 100):
    '''Creating a grid of n by n that covers the input geo feature geo_df'''
    crs = geo_df.crs
    x_range = [min(geo_df.long),max(geo_df.long)]
    y_range = [min(geo_df.lat),max(geo_df.lat)]

    x_grids = np.arange(x_range[0],x_range[1],step = (x_range[1] - x_range[0])/n)
    y_grids = np.arange(y_range[0],y_range[1],step = (y_range[1] - y_range[0])/n)

    grids = []

    for i in range(len(x_grids)-2):
        for j in range(len(y_grids)-2):
            grids.append(Polygon(zip([x_grids[i],x_grids[i],x_grids[i+1],x_grids[i+1]],
                                     [y_grids[j],y_grids[j+1],y_grids[j+1],y_grids[j]])))

    geo_grids = GeoDataFrame([],crs = crs,geometry = grids)
    return geo_grids

def geo_join(geo_df, n):
    geo_grids = create_grids(geo_df,n = n)
    geo_join = gpd.sjoin(geo_df,geo_grids,how = 'inner',op='within')
    return geo_join

The follow cell gives us the option to further filter down the desired permit types in our study

Now we are ready to create our spatial objects. This turns our permit data into a geodataframe. Right now it’s just dots with no fine-tuned feature to display. We use EPSG:2805 (Massachusetts mainland) projected coordinate system. crs is not a necessary argument in creating GeoDataFrame, but I found it helpful in aligning our geodataframe to our basemaps.

permit_clean = permit[(~pd.isna(permit.long)) & (~pd.isna(permit.lat))
                      & (~pd.isna(permit.declared_valuation))]
geometry = [Point(xy) for xy in zip(permit_clean.long, permit_clean.lat)]
crs = {'init':'epsg:2805'}
geo_permit = GeoDataFrame(permit_clean,geometry = geometry,crs = crs)
geo_permit.plot()

geometry = [Point(xy) for xy in zip(permit_clean.long, permit_clean.lat)]
crs = {'init':'epsg:2805'} # crs be consistent
geo_permit = GeoDataFrame(permit_clean,geometry = geometry,crs = crs)
geo_grids = create_grids(geo_permit,100)

GeoPandas created simple spatial join function. When joining point features to shapefiles, specifying op = "within" can significantly reduce computational burden.

geo_join = gpd.sjoin(geo_permit, geo_grids, how='inner', op='within')

We use pandas groupby function to get a mean declared value within each grid cell. Then, We are specifying some fine tuning to get our dataset ready for plotting, such as filtering out NA mean declared values and extreme values such as ones out of [0.05, 0.95] quantile. And finally, we transfer the mean declared value to a log scale.

reported_value = geo_join.groupby('index_right').mean()['declared_valuation']
geo_grids_value = geo_grids.merge(reported_value,how = 'left',left_index = True,right_index=True)
# here we are joining by index, which are indices inherited from the geo_grids object
geo_grids_value = geo_grids_value[~pd.isna(geo_grids_value.declared_valuation)]
geo_grids_value = geo_grids_value[(geo_grids_value.declared_valuation > np.percentile(geo_grids_value.declared_valuation, 5))
                                  & (geo_grids_value.declared_valuation < np.percentile(geo_grids_value.declared_valuation, 97))]
geo_grids_value['declared_valuation_lg'] = np.log(geo_grids_value.declared_valuation+0.01)

Here we are adding a few extra features to plot

  • high value commercial construction (declared valuation > \(10^6\) USD)
  • high value public investment (investment made by City of Boston with value > \(10^6\) USD)
geometry_full = [Point(xy) for xy in zip(permit_raw.long, permit_raw.lat)]
crs = {'init':'epsg:2805'}
geo_permit_full = GeoDataFrame(permit_raw,geometry = geometry_full,crs = crs)


geo_permit_comm = geo_permit_full[(geo_permit_full.occupancytype == 'Comm')
                             & (geo_permit_full.declared_valuation > 1e6)
                             & (geo_permit_full.description.str.match("New.+"))]

public_invest = geo_permit_full[(geo_permit_full.description == 'City of Boston')
                           & (geo_permit_full.declared_valuation > 1e6)]

Finally we are ready to create the plot

def creat_base_map(geo_df,ax):
    w, s, e, n = geo_df.total_bounds
#    zoom = ctx.tile._calculate_zoom(w, s, e, n)
    image, extent = ctx.bounds2img(w, s, e, n, ll = True)
    ax.imshow(image, extent= (extent[0],extent[2],extent[1],extent[3]), alpha = 0.8,cmap=plt.get_cmap('gray'))
    return ax
_,ax = plt.subplots(1, figsize=(15,15))

ax = creat_base_map(geo_grids,ax)
geo_grids_value.plot(edgecolor='k', column='declared_valuation_lg', ax=ax, alpha=0.5, legend=True, label = 'Declared value on a log scale')
geo_permit_comm.plot(ax=ax, alpha=0.9, legend=True, color = 'r',label = 'high value commercial sites',marker = 'v')
public_invest.plot(ax=ax, alpha=0.3, legend=True, color = 'k',label = 'high value public investment',marker = 'x')

ax.set_title('Boston building permits value distribution')
ax.set_ylabel('Declared value on log(10) scale')
ax.set_xlabel('red triangle: high value commercial sites\n black cross: high value city investment')
plt.setp(ax.get_xticklabels(), visible=False)
plt.setp(ax.get_yticklabels(), visible=False)
ax.tick_params(axis='both', which='both', length=0)
plt.show()

Observations

  1. When all projects are considered, it is obvious that the downtown / Back Bay / Fenway area have the highest development value
  2. Seaport \ Southie have a lot of high value properties under permit as well, standing in the second tier of development value
  3. There are spread-out valuable development pockets in Dorchester and Jamaica Plain. Overall, it seems like Dorchester has a slight advantage of development value, especially for the areas along the MBTA Red line and freeway.

What if we filter for only new residential constructions?

Since I’m mostly interested in new constructions, it might be worthwhile to focus on only residential new constructions for a secondary analysis.

geo_join_filt = geo_join[(geo_join.occupancytype.str.match('(.+FAM|.unit|Mixed)')) &#matching multiple family units and mixed properties
                        (~pd.isna(geo_join.description)) &
                        (geo_join.description.str.match('(Addition|Erect|New construction)'))]# matching new constructions
reported_value_filt = geo_join_filt.groupby('index_right').mean()['declared_valuation']
geo_grids_value_filt = geo_grids.merge(reported_value_filt,how = 'left',left_index = True,right_index=True)
# here we are joining by index, which are indices inherited from the geo_grids object
geo_grids_value_filt = geo_grids_value_filt[~pd.isna(geo_grids_value_filt.declared_valuation)]
geo_grids_value_filt = geo_grids_value_filt[(geo_grids_value_filt.declared_valuation > np.percentile(geo_grids_value_filt.declared_valuation, 5))
                                  & (geo_grids_value_filt.declared_valuation < np.percentile(geo_grids_value_filt.declared_valuation, 95))]
geo_grids_value_filt['declared_valuation_lg'] = np.log(geo_grids_value_filt.declared_valuation+0.01)
_,ax = plt.subplots(1, figsize=(15,15))

ax = creat_base_map(geo_grids,ax)
geo_grids_value_filt.plot(edgecolor='k', column='declared_valuation_lg', ax=ax, alpha=0.5, legend=True, label = 'Declared value on a log scale')
geo_permit_comm.plot(ax=ax, alpha=0.9, legend=True, color = 'r',label = 'high value commercial sites',marker = 'v')
public_invest.plot(ax=ax, alpha=0.3, legend=True, color = 'k',label = 'high value public investment',marker = 'x')

ax.set_title('Boston building permits value distribution')
ax.set_ylabel('Declared value on log(10) scale')
ax.set_xlabel('red triangle: high value commercial sites\n black cross: high value city investment')
plt.setp(ax.get_xticklabels(), visible=False)
plt.setp(ax.get_yticklabels(), visible=False)
ax.tick_params(axis='both', which='both', length=0)
plt.show()

Observations

  1. A lot of new constructions are happening in Southie and South JP
  2. The Southie development value seems to spill out to North Dorchester along the T / freeway line

Beyond grids: KDEPLOT for density plots

The grid system is not smooth enough. I continued to experiment with KDEplot to estimate “density” of permit dots. Notice that default KDEplots coming with seaborn doesn’t take in weights, therefore we have to manually create oversampled dataset with regard to declared evaluation.

New construction density

import geoplot as gplt
permit_KDE = permit_clean[(permit_clean.declared_valuation > 100) & 
                          (permit_clean.description.str.match('(Addition|Erect|New construction)'))]
permit_KDE['w_col'] = np.floor(np.log(permit_clean.declared_valuation))
permit_KDE.w_col.plot(kind = 'hist')

Taking a look at the distribution of the integer log declared values in those new constructions.

df_point = permit_KDE[['long','lat','w_col']]
df_point.shape
(3801, 3)
geometry = [Point(xy) for xy in zip(df_point.long, df_point.lat)]
crs = {'init':'epsg:2805'}
geo_df_point = GeoDataFrame(df_point,geometry = geometry,crs = crs)
_,ax = plt.subplots(1, figsize=(15,15))
ax = creat_base_map(geo_df_point,ax)
gplt.kdeplot(geo_df_point,n_levels=25,ax=ax,shade = True,cmap='Reds',alpha = 0.6)
plt.setp(ax.get_xticklabels(), visible=False)
plt.setp(ax.get_yticklabels(), visible=False)
ax.tick_params(axis='both', which='both', length=0)
plt.show()

As expected, there are big pockets of build-ups in Seaport, North Dorchester, South JP, West Roxbury, and surprisingly, a big of build up in Back Bay / South End. This gives us a sense of where the developments are happening since the sheer number of constructions can drive up property price too.

New construction density weighted by declared valuation

def weighted_points(geo_df, w_col = 'w_col',print_n = False):
    weighted_points = pd.DataFrame({'lat':[],'long':[]})
    for index,i in geo_df.iterrows():
        n = int(i[w_col])
        new_df = pd.DataFrame({'lat':[i.lat]*n,'long':[i.long]*n})
        weighted_points = pd.concat([weighted_points,new_df],axis = 0)
        if (index % 500 == 0) and print_n:
            print('{} done'.format(index))
    return weighted_points
df_point_weighted = weighted_points(df_point.reindex())
geometry = [Point(xy) for xy in zip(df_point_weighted.long, permit_clean.lat)]
crs = {'init':'epsg:2805'}
geo_df_point_weighted = GeoDataFrame(df_point_weighted,geometry = geometry,crs = crs)
import geoplot as gplt


_,ax = plt.subplots(1, figsize=(15,15))

ax = creat_base_map(geo_df_point,ax)
gplt.kdeplot(geo_df_point_weighted,n_levels=25,ax=ax,shade = True,cmap='Blues',alpha = 0.6)
plt.setp(ax.get_xticklabels(), visible=False)
plt.setp(ax.get_yticklabels(), visible=False)
ax.tick_params(axis='both', which='both', length=0)
plt.show()

Here we see when weighted by declared valuation, downtown / Seaport area certainly overshadows the rest of the city. There high value band spans west to Fenway and south to Southie / North Dorchester.

These maps are subjected to boundary effects, hence are more likely to show high density / value in the central areas. Nevertheless, the provide another angle to investigate where developments are happening and money are flowing.