NYC Property Sales
Dataset Exploratory Data Analysis¶This dataset is a record of every building or building unit (apartment, etc.) sold in the New York City property market over a 12-month period.
This dataset contains the location, address, type, sale price, and sale date of building units sold. A reference on the trickier fields:
BOROUGH: A digit code for the borough the property is located in; in order these are Manhattan (1), Bronx (2), Brooklyn (3), Queens (4), and Staten Island (5). BLOCK; LOT: The combination of borough, block, and lot forms a unique key for property in New York City. Commonly called a BBL. BUILDING CLASS AT PRESENT and BUILDING CLASS AT TIME OF SALE: The type of building at various points in time. See the glossary linked to below. For further reference on individual fields see the Glossary of Terms. For the building classification codes see the Building Classifications Glossary.
Note that because this is a financial transaction dataset, there are some points that need to be kept in mind:
Many sales occur with a nonsensically small dollar amount: $0 most commonly. These sales are actually transfers of deeds between parties: for example, parents transferring ownership to their home to a child after moving out for retirement. This dataset uses the financial definition of a building/building unit, for tax purposes. In case a single entity owns the building in question, a sale covers the value of the entire building. In case a building is owned piecemeal by its residents (a condominium), a sale refers to a single apartment (or group of apartments) owned by some individual.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# df = pd.read_csv('nyc-rolling-sales.csv')
import os, types
import pandas as pd
from botocore.client import Config
import ibm_boto3
def __iter__(self): return 0
# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
if os.environ.get('RUNTIME_ENV_LOCATION_TYPE') == 'external':
endpoint_543ddc96d06d47e3ab6611213fb10e07 = 'https://s3.eu.cloud-object-storage.appdomain.cloud'
else:
endpoint_543ddc96d06d47e3ab6611213fb10e07 = 'https://s3.private.eu.cloud-object-storage.appdomain.cloud'
client_543ddc96d06d47e3ab6611213fb10e07 = ibm_boto3.client(service_name='s3',
ibm_api_key_id='4amp42sAox3ifYH8najKIqYTC91NpP27VimSDCqMfUE3',
ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
config=Config(signature_version='oauth'),
endpoint_url=endpoint_543ddc96d06d47e3ab6611213fb10e07)
body = client_543ddc96d06d47e3ab6611213fb10e07.get_object(Bucket='skincancerdetector-donotdelete-pr-6cotsbrae4bgkz',Key='nyc-rolling-sales.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )
df = pd.read_csv(body)
df.head()
Unnamed: 0 | BOROUGH | NEIGHBORHOOD | BUILDING CLASS CATEGORY | TAX CLASS AT PRESENT | BLOCK | LOT | EASE-MENT | BUILDING CLASS AT PRESENT | ADDRESS | ... | RESIDENTIAL UNITS | COMMERCIAL UNITS | TOTAL UNITS | LAND SQUARE FEET | GROSS SQUARE FEET | YEAR BUILT | TAX CLASS AT TIME OF SALE | BUILDING CLASS AT TIME OF SALE | SALE PRICE | SALE DATE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4 | 1 | ALPHABET CITY | 07 RENTALS - WALKUP APARTMENTS | 2A | 392 | 6 | C2 | 153 AVENUE B | ... | 5 | 0 | 5 | 1633 | 6440 | 1900 | 2 | C2 | 6625000 | 2017-07-19 00:00:00 | |
1 | 5 | 1 | ALPHABET CITY | 07 RENTALS - WALKUP APARTMENTS | 2 | 399 | 26 | C7 | 234 EAST 4TH STREET | ... | 28 | 3 | 31 | 4616 | 18690 | 1900 | 2 | C7 | - | 2016-12-14 00:00:00 | |
2 | 6 | 1 | ALPHABET CITY | 07 RENTALS - WALKUP APARTMENTS | 2 | 399 | 39 | C7 | 197 EAST 3RD STREET | ... | 16 | 1 | 17 | 2212 | 7803 | 1900 | 2 | C7 | - | 2016-12-09 00:00:00 | |
3 | 7 | 1 | ALPHABET CITY | 07 RENTALS - WALKUP APARTMENTS | 2B | 402 | 21 | C4 | 154 EAST 7TH STREET | ... | 10 | 0 | 10 | 2272 | 6794 | 1913 | 2 | C4 | 3936272 | 2016-09-23 00:00:00 | |
4 | 8 | 1 | ALPHABET CITY | 07 RENTALS - WALKUP APARTMENTS | 2A | 404 | 55 | C2 | 301 EAST 10TH STREET | ... | 6 | 0 | 6 | 2369 | 4615 | 1900 | 2 | C2 | 8000000 | 2016-11-17 00:00:00 |
5 rows × 22 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 84548 entries, 0 to 84547 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 84548 non-null int64 1 BOROUGH 84548 non-null int64 2 NEIGHBORHOOD 84548 non-null object 3 BUILDING CLASS CATEGORY 84548 non-null object 4 TAX CLASS AT PRESENT 84548 non-null object 5 BLOCK 84548 non-null int64 6 LOT 84548 non-null int64 7 EASE-MENT 84548 non-null object 8 BUILDING CLASS AT PRESENT 84548 non-null object 9 ADDRESS 84548 non-null object 10 APARTMENT NUMBER 84548 non-null object 11 ZIP CODE 84548 non-null int64 12 RESIDENTIAL UNITS 84548 non-null int64 13 COMMERCIAL UNITS 84548 non-null int64 14 TOTAL UNITS 84548 non-null int64 15 LAND SQUARE FEET 84548 non-null object 16 GROSS SQUARE FEET 84548 non-null object 17 YEAR BUILT 84548 non-null int64 18 TAX CLASS AT TIME OF SALE 84548 non-null int64 19 BUILDING CLASS AT TIME OF SALE 84548 non-null object 20 SALE PRICE 84548 non-null object 21 SALE DATE 84548 non-null object dtypes: int64(10), object(12) memory usage: 14.2+ MB
df.columns
Index(['Unnamed: 0', 'BOROUGH', 'NEIGHBORHOOD', 'BUILDING CLASS CATEGORY', 'TAX CLASS AT PRESENT', 'BLOCK', 'LOT', 'EASE-MENT', 'BUILDING CLASS AT PRESENT', 'ADDRESS', 'APARTMENT NUMBER', 'ZIP CODE', 'RESIDENTIAL UNITS', 'COMMERCIAL UNITS', 'TOTAL UNITS', 'LAND SQUARE FEET', 'GROSS SQUARE FEET', 'YEAR BUILT', 'TAX CLASS AT TIME OF SALE', 'BUILDING CLASS AT TIME OF SALE', 'SALE PRICE', 'SALE DATE'], dtype='object')
#delete 'Unnamed: 0' column
df.drop('Unnamed: 0',axis=1,inplace=True)
df.head()
BOROUGH | NEIGHBORHOOD | BUILDING CLASS CATEGORY | TAX CLASS AT PRESENT | BLOCK | LOT | EASE-MENT | BUILDING CLASS AT PRESENT | ADDRESS | APARTMENT NUMBER | ... | RESIDENTIAL UNITS | COMMERCIAL UNITS | TOTAL UNITS | LAND SQUARE FEET | GROSS SQUARE FEET | YEAR BUILT | TAX CLASS AT TIME OF SALE | BUILDING CLASS AT TIME OF SALE | SALE PRICE | SALE DATE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | ALPHABET CITY | 07 RENTALS - WALKUP APARTMENTS | 2A | 392 | 6 | C2 | 153 AVENUE B | ... | 5 | 0 | 5 | 1633 | 6440 | 1900 | 2 | C2 | 6625000 | 2017-07-19 00:00:00 | ||
1 | 1 | ALPHABET CITY | 07 RENTALS - WALKUP APARTMENTS | 2 | 399 | 26 | C7 | 234 EAST 4TH STREET | ... | 28 | 3 | 31 | 4616 | 18690 | 1900 | 2 | C7 | - | 2016-12-14 00:00:00 | ||
2 | 1 | ALPHABET CITY | 07 RENTALS - WALKUP APARTMENTS | 2 | 399 | 39 | C7 | 197 EAST 3RD STREET | ... | 16 | 1 | 17 | 2212 | 7803 | 1900 | 2 | C7 | - | 2016-12-09 00:00:00 | ||
3 | 1 | ALPHABET CITY | 07 RENTALS - WALKUP APARTMENTS | 2B | 402 | 21 | C4 | 154 EAST 7TH STREET | ... | 10 | 0 | 10 | 2272 | 6794 | 1913 | 2 | C4 | 3936272 | 2016-09-23 00:00:00 | ||
4 | 1 | ALPHABET CITY | 07 RENTALS - WALKUP APARTMENTS | 2A | 404 | 55 | C2 | 301 EAST 10TH STREET | ... | 6 | 0 | 6 | 2369 | 4615 | 1900 | 2 | C2 | 8000000 | 2016-11-17 00:00:00 |
5 rows × 21 columns
df['NEIGHBORHOOD'].value_counts().count()
254
df['BOROUGH'].value_counts().count()
5
> since BOROUGH is A digit code for the borough the property is located in; in order these are Manhattan (1), Bronx (2), Brooklyn (3), Queens (4), and Staten Island (5)
> and we have 254 different NEIGHBORHOOD
df['BOROUGH'] = df['BOROUGH'].astype(str)
df['BOROUGH'].replace({'1':'Manhattan','2':'Bronx','3':'Brooklyn','4':'Queens','5':'Staten Island'},inplace=True)
df['SALE PRICE'] = df['SALE PRICE'].replace({' - ':'0'})
df['SALE PRICE'] = df['SALE PRICE'].astype(float)
df['SALE DATE'] = pd.to_datetime(df['SALE DATE'])
df['sale_month']= df['SALE DATE'].dt.month_name()
df['sale_year']= df['SALE DATE'].dt.year
df.to_csv('cleaned_data.csv',index=False)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 84548 entries, 0 to 84547 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BOROUGH 84548 non-null object 1 NEIGHBORHOOD 84548 non-null object 2 BUILDING CLASS CATEGORY 84548 non-null object 3 TAX CLASS AT PRESENT 84548 non-null object 4 BLOCK 84548 non-null int64 5 LOT 84548 non-null int64 6 EASE-MENT 84548 non-null object 7 BUILDING CLASS AT PRESENT 84548 non-null object 8 ADDRESS 84548 non-null object 9 APARTMENT NUMBER 84548 non-null object 10 ZIP CODE 84548 non-null int64 11 RESIDENTIAL UNITS 84548 non-null int64 12 COMMERCIAL UNITS 84548 non-null int64 13 TOTAL UNITS 84548 non-null int64 14 LAND SQUARE FEET 84548 non-null object 15 GROSS SQUARE FEET 84548 non-null object 16 YEAR BUILT 84548 non-null int64 17 TAX CLASS AT TIME OF SALE 84548 non-null int64 18 BUILDING CLASS AT TIME OF SALE 84548 non-null object 19 SALE PRICE 84548 non-null float64 20 SALE DATE 84548 non-null datetime64[ns] 21 sale_month 84548 non-null object 22 sale_year 84548 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(9), object(12) memory usage: 14.8+ MB
df.head()
BOROUGH | NEIGHBORHOOD | BUILDING CLASS CATEGORY | TAX CLASS AT PRESENT | BLOCK | LOT | EASE-MENT | BUILDING CLASS AT PRESENT | ADDRESS | APARTMENT NUMBER | ... | TOTAL UNITS | LAND SQUARE FEET | GROSS SQUARE FEET | YEAR BUILT | TAX CLASS AT TIME OF SALE | BUILDING CLASS AT TIME OF SALE | SALE PRICE | SALE DATE | sale_month | sale_year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Manhattan | ALPHABET CITY | 07 RENTALS - WALKUP APARTMENTS | 2A | 392 | 6 | C2 | 153 AVENUE B | ... | 5 | 1633 | 6440 | 1900 | 2 | C2 | 6625000.0 | 2017-07-19 | July | 2017 | ||
1 | Manhattan | ALPHABET CITY | 07 RENTALS - WALKUP APARTMENTS | 2 | 399 | 26 | C7 | 234 EAST 4TH STREET | ... | 31 | 4616 | 18690 | 1900 | 2 | C7 | 0.0 | 2016-12-14 | December | 2016 | ||
2 | Manhattan | ALPHABET CITY | 07 RENTALS - WALKUP APARTMENTS | 2 | 399 | 39 | C7 | 197 EAST 3RD STREET | ... | 17 | 2212 | 7803 | 1900 | 2 | C7 | 0.0 | 2016-12-09 | December | 2016 | ||
3 | Manhattan | ALPHABET CITY | 07 RENTALS - WALKUP APARTMENTS | 2B | 402 | 21 | C4 | 154 EAST 7TH STREET | ... | 10 | 2272 | 6794 | 1913 | 2 | C4 | 3936272.0 | 2016-09-23 | September | 2016 | ||
4 | Manhattan | ALPHABET CITY | 07 RENTALS - WALKUP APARTMENTS | 2A | 404 | 55 | C2 | 301 EAST 10TH STREET | ... | 6 | 2369 | 4615 | 1900 | 2 | C2 | 8000000.0 | 2016-11-17 | November | 2016 |
5 rows × 23 columns
df['BOROUGH'].value_counts().plot.bar()
plt.title("Borough by property's sales amount");
df['SALE PRICE'].groupby(df['BOROUGH']).mean().plot.bar()
plt.title("Borough by property's average sales");
> Queens have the most sales records but Manhattan have the largest average sale price so we need more information about the reason?
New York City's five boroughs<style data-mw-deduplicate="TemplateStyles:r992953826">.mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .navbar-collapse{float:left;text-align:left}.mw-parser-output .navbar-boxtext{word-spacing:0}.mw-parser-output .navbar ul{display:inline-block;white-space:nowrap;line-height:inherit}.mw-parser-output .navbar-brackets::before{margin-right:-0.125em;content:"[ "}.mw-parser-output .navbar-brackets::after{margin-left:-0.125em;content:" ]"}.mw-parser-output .navbar li{word-spacing:-0.125em}.mw-parser-output .navbar-mini abbr{font-variant:small-caps;border-bottom:none;text-decoration:none;cursor:inherit}.mw-parser-output .navbar-ct-full{font-size:114%;margin:0 7em}.mw-parser-output .navbar-ct-mini{font-size:114%;margin:0 4em}.mw-parser-output .infobox .navbar{font-size:100%}.mw-parser-output .navbox .navbar{display:block;font-size:100%}.mw-parser-output .navbox-title .navbar{float:left;text-align:left;margin-right:0.5em}</style>
| ||||||||
---|---|---|---|---|---|---|---|---|
Jurisdiction | Population | GDP | Land area | Density | ||||
Borough | County | Estimate (2019) |
billions (2012 US$) |
square miles |
square km |
persons / mi2 |
persons / km2 | |
Bronx
|
1,418,207 | 42.695 | 42.10 | 109.04 | 33,867 | 13,006 | ||
Kings
|
2,559,903 | 91.559 | 70.82 | 183.42 | 36,147 | 13,957 | ||
New York
|
1,628,706 | 600.244 | 22.83 | 59.13 | 71,341 | 27,544 | ||
Queens
|
2,253,858 | 93.310 | 108.53 | 281.09 | 20,767 | 8,018 | ||
Richmond
|
476,143 | 14.514 | 58.37 | 151.18 | 8,157 | 3,150 | ||
8,336,817 | 842.343 | 302.64 | 783.83 | 27,547 | 10,636 | |||
19,453,561 | 1,731.910 | 47,126.40 | 122,056.82 | 412 | 159 | |||
> ##### according to this image from Boroughs_of_New_York_City Manhattan have better geographocal location and it is the most densely populated and geographically smallest of the five boroughs of New York City
> ##### Manhattan GDP is the largest between the five so the property's should be more expensive than others and thats justifies the data that queens have the largest area but not the most expensive propety's
df[(df['BOROUGH'] =='Queens') & (df['SALE PRICE'] != 0)].count()
BOROUGH 18441 NEIGHBORHOOD 18441 BUILDING CLASS CATEGORY 18441 TAX CLASS AT PRESENT 18441 BLOCK 18441 LOT 18441 EASE-MENT 18441 BUILDING CLASS AT PRESENT 18441 ADDRESS 18441 APARTMENT NUMBER 18441 ZIP CODE 18441 RESIDENTIAL UNITS 18441 COMMERCIAL UNITS 18441 TOTAL UNITS 18441 LAND SQUARE FEET 18441 GROSS SQUARE FEET 18441 YEAR BUILT 18441 TAX CLASS AT TIME OF SALE 18441 BUILDING CLASS AT TIME OF SALE 18441 SALE PRICE 18441 SALE DATE 18441 sale_month 18441 sale_year 18441 dtype: int64
df[df['SALE PRICE'] == 0]['BOROUGH'].value_counts().sort_values().plot.bar(color=['r', 'b', 'k', 'm', 'c'])
plt.title("Borough property's transaction amount");
> ##### since 0 sales means that it's a property transaction so this graph shows the amount of property transactions in each borough
sns.histplot(df[df['YEAR BUILT']!=0]['YEAR BUILT'],bins=100);
plt.ylabel("Property's")
plt.xlabel("Year That Have been built in");
df[df['YEAR BUILT']!=0]['YEAR BUILT'].value_counts()
1920 6045 1930 5043 1925 4312 1910 3585 1950 3156 ... 1832 1 1680 1 1829 1 1888 1 1855 1 Name: YEAR BUILT, Length: 157, dtype: int64
> which mean that most of the properties that sold or been transationed is built between 1910 to 1950 where the maximum properties sales is in 1920 may be they have a good building structures!!
plt.figure(figsize=(20,20))
df['SALE PRICE'].groupby(df['BUILDING CLASS CATEGORY']).mean().sort_values().plot.barh()
plt.ylabel("Sale Price")
plt.title('average sales price per propertys usage of buildings');
> seems to Luxury hotels and office buildings is represents the largest building category in sales
> and Untility properties , Warehouses/Factory/Indus are the lowest building category in sales
plt.figure(figsize=(10,6))
df['SALE PRICE'].groupby(df['NEIGHBORHOOD']).mean().sort_values()[244:254].plot.barh()
plt.title('Top 10 average sales price per NEIGHBORHOOD ')
plt.xlabel("Sale Price");
> BloomField and Midtown CBD have the largest average sales between all neighborhood
plt.figure(figsize=(15,6))
sns.lineplot(x='sale_month',y='SALE PRICE',data=df)
plt.title('Sales Trend per month from 2016 to 2017')
plt.ylabel('Sale Price')
plt.xlabel('Month')
plt.show();
> Seem to be that in may the mean and the 95% confidence interval is having a the biggest range in both the maximum value and repeated values which indicate that in may over 2016 and 2017 we have the most of property's sales or transactions
plt.figure(figsize=(10,6))
sns.lineplot(x='BOROUGH',y='SALE PRICE',data=df)
plt.title('Sales Trend per BOROUGH')
plt.ylabel('Sale Price')
plt.show();
> as we saw before Manhattan has the highest sales
> we need to work on marketing for property's in queens and staten island cause they are have the lowest sales of all
plt.figure(figsize=(20,6))
sns.barplot(x='BOROUGH', y='SALE PRICE', hue='sale_month', data=df, palette='rainbow');
plt.title('Sales Trend per BOROUGH by month')
plt.ylabel('Sale Price')
plt.show()
> ##### May and December are the largest sales month in Manhattan in both years 2016 and 2017 > ##### February is the smallest sales month in Manhattan in both years 2016 and 2017
> ##### December is the smallest sales month in Bronx in both years 2016 and 2017
plt.figure(figsize=(10,6))
sns.heatmap(df.corr());
> so there is no effictive correlation between any numerical variables except residential and commercial units and total units and that's reasonable but not usfull
> reasonable becaues residentail and commercial is summed in the total units that why we have 0.6 of residentail correlated with the total and 0.4 of commercial correlated with total which indicate that the number of residential units is bigger than commercial units
df['RESIDENTIAL UNITS'].sum(),df['COMMERCIAL UNITS'].sum(),df['TOTAL UNITS'].sum()
(171232, 16365, 190164)
plt.figure(figsize=(12,8))
plt.subplot(1, 2, 1)
sns.barplot(x="BOROUGH", y="RESIDENTIAL UNITS", data=df, estimator=sum, ci=None)
plt.ylabel('Residential Units')
plt.subplot(1, 2, 2)
sns.barplot(x="BOROUGH", y="COMMERCIAL UNITS", data=df, estimator=sum, ci=None)
plt.ylabel('Commercial Units');
> ##### Residential units scale have a huge scale difference from commercial units in number of propeties that have been sold
Year Over Year Analysis
:¶plt.figure(figsize=(20,6))
sns.lineplot(x="BOROUGH",y="SALE PRICE",hue='sale_year',data=df, palette='rainbow',ci=None)
plt.title('Sales Trend per BOROUGH by year');
plt.ylabel('Sale Price');
> ##### This shows the comparison in years of sales for each borough
> ##### obviously Manhattan and queens have better sales in 2016 than 2017, but Bronx and Staten Island is having better sales in 2017 than 2016
plt.figure(figsize=(15,6))
plt.subplot(1, 2, 1)
plt.title('Sales Trend per Commercial Units by year')
sns.lineplot(x="COMMERCIAL UNITS",y="SALE PRICE",hue='sale_year',data=df, palette='rainbow')
plt.ylabel('Sale Price')
plt.xlabel('Commercial Units')
plt.subplot(1, 2, 2)
plt.title('Sales Trend per Residential Units by year')
sns.lineplot(x="RESIDENTIAL UNITS",y="SALE PRICE",hue='sale_year',data=df, palette='rainbow')
plt.ylabel('Sale Price')
plt.xlabel('Residential Units');
> ##### This explain how Residential units makes more sales in 2016 than 2017 > ##### and this tells us that most property's revenue come from Residential units and not the Commercial units
plt.figure(figsize=(12,8))
plt.title('Sales Trend per TAX CLASS by year')
sns.barplot(x="sale_year",y="SALE PRICE",hue='TAX CLASS AT PRESENT',data=df[df['TAX CLASS AT PRESENT']!= ' '], palette='rainbow');
plt.xlabel('Sale Year')
plt.ylabel('Sale Price');
> Tax Class (4) which Includes all other properties not included in class 1,2, and 3, such as offices, factories, warehouses, garage buildings, etc, have the highest average sales over years
plt.subplots(figsize=(12,8))
sns.barplot(x='sale_year', y='SALE PRICE', hue='BOROUGH', data=df, palette='rainbow', ci=None)
plt.title('Sales per Borough from 2016-2017')
plt.ylabel('Sale Price')
plt.xlabel('Sale Year');
> so we know that manhattan has the most revenue but by this chart we can tell in which order the all five borough share in the total revenue by 2016 and 2017 and it's as followsin descending order ( Manhattan , Brooklyn, Bronx, Queens, Staten Island)