AirBnb analytics: Mexico City
Overview
Using Airbnb data, we created a model with the aim to analyse data about Airbnb listings and predict the total cost for two people staying four nights in Mexico city.
To create a successful model we had to:
- Apply data-cleaning techniques to extract data from Airbnb list and create a dataset.
- Use descriptive statistics to conduct data modification and relevant analysis. Apply multi-regression modeling to visualize and analyze the data that we have collected.
Our final model takes into consideration the:
- Neighbourhood
- Room type
- N. of bathrooms
- N. of bedrooms
The model predicts that an apartment with a private room in the neighbourhoood of Tlalpan, Mexico City, which has over 15 reviews and an average review score above 95/100 will cost, on average, 2,063 MXN$ with a 95% confidence interval ranging from 2,035 to 2,090.
Loading the data
We use as source: www.insideairbnb.com Airbnb listings data was scraped directlyfrom airbnb.com
listings <- vroom::vroom("http://data.insideairbnb.com/mexico/df/mexico-city/2020-06-20/data/listings.csv.gz")
Exploratory Data Analysis (EDA)
Inspecting the data
In the original data set, we collected 21,824 values across 106 variables of which 52 are character, 15 are logical, and 38 are numerical. As we can tell, many variables are either URLs or descriptions which could not be used as predictive variables.
Thus, we first applied glimpse and skim to look through the data set and chose what we believe could be used in the later analysis stage.
Glimpse and skim are two of my favourite functions (others being table, favstats, and summary) allow us to understand better the dataset we are working with:
- Glimpse gives us a summary of the variables and observations included.
- Skim provides us with useful information concerning the missing data, a.k.a. NAs
Here, we start selecting specific variable that may be useful for our deep analysis later in the process.
tidy_listings <- listings %>%
dplyr::select(host_is_superhost, neighbourhood_cleansed, latitude, longitude, is_location_exact, property_type, room_type,
bathrooms, bedrooms, beds, square_feet, price, cleaning_fee, extra_people, number_of_reviews, review_scores_rating,
cancellation_policy, availability_365, amenities, host_identity_verified, minimum_nights, guests_included)
Looking through the raw data set, we will drop variable square_feet because most airbnb houses (21,758) do not provide it, so it will not be a meaningful predictive variable and a large amount of NA values will influence our analysis later.
tidy_listings <- tidy_listings %>%
dplyr::select(-square_feet)
Data Wrangling
Seen from the data set, we find that variables price, cleaning_fee, and extra_people should have been numerical but are characters, so we will change them back to numbers.
Also, it is not meaningful to have too many neighbourhoods for our later analysis. Thus, we use six neighbourhoods with the largest Airbnb counts and combine other neighbourhoods into these six according to geographical closeness.
Besides, amenities are in lists right now which could not be used to predict price so we create a new variable named amenities_clean that stores amenity counts.
Since the vast majority of the observations in the data are one of the top four or five property types, we would like to create a simplified version of property_type variable that has five categories: the top four categories and Other.
Moreover, as we only care about houses that offer four-night renting, we filter data set to only include minimum nights less than or equal to four nights.
Lastly, We also create a new variable named price_4_nights to calculate the total cost for two people to stay at the Airbnb property for four nights.
### Changing characters variables to numerical ones
tidy_listings <- tidy_listings %>%
mutate(price = parse_number(price), cleaning_fee = parse_number(cleaning_fee),
extra_people = parse_number(extra_people))
### Creating neighbourhood_cleased variable
tidy_listings$neighbourhood_cleansed <- as.factor(tidy_listings$neighbourhood_cleansed)
tidy_listings <- tidy_listings %>%
mutate(neighbourhood_simplified = case_when(
neighbourhood_cleansed %in% c("Cuauhtémoc", "Venustiano Carranza", "Gustavo A. Madero") ~ "Cuauhtémoc",
neighbourhood_cleansed %in% c("Miguel Hidalgo", "Azcapotzalaco") ~ "Miguel Hidalgo",
neighbourhood_cleansed %in% c("Benito Juárez", "Iztacalco") ~ "Benito Juárez",
neighbourhood_cleansed %in% c("Coyoacán", "Iztapalapa") ~ "Coyoacán",
neighbourhood_cleansed %in% c("Álvaro Obregón", "Cuajimalpa de Morelos", "La Magdalena Contreras") ~ "Álvaro Obregón",
TRUE ~ "Tlalpan"
))
### Creating amenities_clean variable
tidy_listings <- tidy_listings %>%
mutate(amenities_clean = count.fields(textConnection(amenities), sep = ","))
### Sorting the data to show the four most common properties types
sort(table(tidy_listings$property_type),decreasing=TRUE)[1:4]
##
## Apartment House Condominium Loft
## 13545 3289 1577 1120
### Sorting the data such that the top four property types have their own category and the remaining categories fall into 'other'.
tidy_listings <- tidy_listings %>%
mutate(prop_type_simplified = case_when(
property_type %in% c("Apartment","House", "Condominium","Loft") ~ property_type,
TRUE ~ "Other"
))
### Filtering minimum nights lower than or equal to 4 nights
tidy_listings <- filter(tidy_listings, minimum_nights <= 4)
### Change NA values to 0 in cleaning fees
tidy_listings <- tidy_listings %>%
mutate(cleaning_fee = case_when(
is.na(cleaning_fee) ~ 0,
TRUE ~ cleaning_fee
))
### Create a new variable price_4_nights
tidy_listings <- tidy_listings %>%
mutate(price_4_nights = case_when(
guests_included >= 2 ~ 4*price + cleaning_fee,
guests_included < 2 ~ 4*price + cleaning_fee + 4*extra_people))
Now, variables are manipulated and adjusted. We assumed ‘NA’ values appear when the host has left the additional cleaning fee box blank, therefore, we set the NAs to 0.
We have more than 5,000 NA values in column review_scores_ratings so we will first plot a scatterplot to see if there is a strong relationship between review score and price.
ggplot(tidy_listings, aes(x = review_scores_rating,
y = price)) +
geom_point()

cor(tidy_listings$review_scores_rating, tidy_listings$price,
use = "complete.obs")
## [1] 0.00346
We fond out that the correlation between review score and price is only 0.0346 which means that there is nearly no linear relationship between them. We, thus, remove this variable from our data set.
Handling Missing values
Let us drop all the rows which include NA values in our data set.
The point is that we cannot keep missing observations because they will mess up our regressions later in the analysis.
# tidy_listings %>%
#dplyr::filter(!is.na(number_of_reviews),
#!is.na(bathrooms),
#!is.na(bedrooms),
#!is.na(beds),
#!is.na(review_scores_rating),
#!is.na(cancellation_policy))
Visualisations
My favourite part comes with nice plots and graphs about different variables for a clearer visualisation of the dataset.
Price variable analysis
summary(tidy_listings$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 408 747 1509 1358 349990
ggplot(tidy_listings, aes(x=price)) +
geom_density()
From summary statistics and distribution, we find out that the median of price is 747 but the maximum is 349,990 which does not make sense and the distribution is highly right-skewed.
Thus, we will include only price under 3,000 in our data set because we believe this is a reasonable price range, and any values higher than that are outliers.
Density plot of price-4-nights
#
tidy_listings <- tidy_listings %>%
filter(price <= 3000)
#
ggplot(tidy_listings, aes(x = price_4_nights)) +
geom_density()

ggplot(tidy_listings, aes(x = price_4_nights)) +
scale_x_log10() +
geom_density()

#log_price_4_nights
tidy_listings <- tidy_listings %>%
mutate (log_price_4_nights = log(price_4_nights))
Then, we plot out the density plot of price_4_nights and find out that the distribution is highly right-skewed which is not appropriate to be used as dependent variable so we try to plot log10 density plot and, now, it seems like a normal distribution. Thus, we create a new variable named log_price_4_nights to use in later regression.
General overview
We use the function GGally::ggpairs() to grab anb overview of the distributions and correlations in our dataset.
ggpairs1 <- tidy_listings %>%
dplyr::select( price, cleaning_fee, extra_people, number_of_reviews,
guests_included, room_type, host_is_superhost, prop_type_simplified) %>%
rename("Price" = price,
"Cleaning fee" = cleaning_fee,
"Extra people" = extra_people,
"N. Reviews" = number_of_reviews,
"Guests included" = guests_included,
"Room type" = room_type,
"Superhost" = host_is_superhost) %>%
GGally::ggpairs() +
theme_minimal()
ggpairs1

Central Mexico City is for the rich
ggplot(tidy_listings, aes(x = reorder(neighbourhood_simplified, price),
y = price,
fill = neighbourhood_simplified)) +
geom_boxplot(alpha = 0.6) +
labs(title = "Central Mexico City is for the rich",
subtitle = "Boxplot of prices across selected neighbourhoods",
caption = "Source: insideairbnb.com",
x= "Neighbourhood" ,
y = "Price per night (MXN $)") +
theme_bw() +
scale_fill_brewer(palette = "Set3") +
scale_fill_discrete(name = "Neighbourhood") +
theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())
It is evident and logical that the most central areas such as Miguel Hidalgo, and Cuauhtemoc Alvaro Obregon of Mexico City are the most high-priced too.
Everyone dreams about a loft in Mexico City
prop_plot <- tidy_listings %>%
filter(price <= 2000)
ggplot(prop_plot, aes(reorder(x = prop_type_simplified, price),
y = price,
fill = prop_type_simplified)) +
geom_boxplot(alpha = 0.6) +
labs(title = "Everyone dreams about a loft in Mexico City",
subtitle = "Boxplot of prices across selected types of properties",
caption = "Source: insideairbnb.com",
x= "Property type" ,
y = "Price per night (MXN $)") +
theme_bw() +
scale_fill_discrete(name = "Property type") +
theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())
On average, lofts and condominia are the most exclusive kinds of property where a couple could stay for a few night in Mexico City.
However, there are a lot of outliers across all the categories considered that may confuse the analysis.
Couples seek the right intimacy in Mexico City
Bar_chart <- tidy_listings %>%
mutate('room_type' = factor(`room_type`, levels = c("Hotel room",
"Shared room",
"Entire home/apt",
"Private room")))
ggplot(Bar_chart, aes(x= room_type,
fill = room_type)) +
geom_bar(alpha = 0.6) +
theme_bw() +
labs(title = "Couples seek the right intimacy in Mexico City",
subtitle = "Total number of rooms by most common types" ,
caption = "Source: insideairbnb.com",
x= "Room type",
y = "Count" ) +
scale_fill_discrete(name = "Room type") +
theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())
The vast majority of habitations offered in Mexico City are either a private room or an entire home/apartment.
Superhost can be super… expensive
ggplot(tidy_listings, aes(x = host_is_superhost,
y = price,
fill = host_is_superhost)) +
geom_boxplot(alpha = 0.6) +
labs(title = "Superhost can be super... expensive",
subtitle = "Boxplot of prices for superhosts and normal hosts",
caption = "Source: insideairbnb.com",
x= "Superhost",
y = "Price per night (MXN $)") +
theme_bw() +
theme(legend.position = "none") +
scale_x_discrete(labels = c("No", "Yes"))
Superhosts afford to charge more for the “superhost” brand which inspires more professionalism and safety to prospective visitors.
Correlation test between n. of reviews and price
From the scatterplot, we can notice a low correlation between the two variables. The distribution suggests us that the mean is close to 0. Therefore, we run a correlation test to check it.
ggplot(tidy_listings, aes(x= number_of_reviews,
y= price)) +
geom_point() +
labs(x = "Number of reviews (#)",
y = "Price per night (MXN $)") +
theme_bw()

ggplot(tidy_listings, aes(x = number_of_reviews)) +
geom_density() +
labs(x = "Number of reviews (#)",
y = "Density") +
theme_bw()

cor(tidy_listings$number_of_reviews, tidy_listings$price,
use = "complete.obs")
## [1] 0.103
tidy_review <- tidy_listings %>%
filter(number_of_reviews <= 20)
ggplot(tidy_review, aes(x= number_of_reviews,
y= price)) +
geom_point() +
labs(x = "Number of reviews (#)",
y = "Price per night (MXN $)") +
theme_bw()

cor(tidy_review$number_of_reviews, tidy_review$price,
use = "complete.obs")
## [1] 0.0354
Most hosts don’t charge extra for cleaning
cleaning_hist <- tidy_listings %>% filter(cleaning_fee <= 1000)
ggplot(cleaning_hist, aes(x= cleaning_fee)) +
geom_histogram(binwidth = 100) +
theme_bw() +
labs(title = "Most hosts don't charge extra for cleaning",
subtitle = "Distribution of cleaning fee charge",
caption = "Source: insideairbnb.com",
x= "Cleaning fee (MXN $)",
y = "Count") +
scale_fill_discrete()
We have filtered out hosts that apply cleaning fees above $1,000 as they are outliers.
Sometimes not everyone is welcome
ggplot(tidy_listings, aes(x= guests_included,
y= price),
fill = guests_included) +
geom_bar(stat="identity") +
theme_bw() +
labs(title = "Sometimes not everyone is welcome",
subtitle = "Number of guests included in the abitation",
caption = "Source: insideairbnb.com",
x= "Guests included (#)",
y = "Price per night (MXN $)" )

ggplot(tidy_listings, aes(x = guests_included,
y = price)) +
geom_point() +
labs(x = "Guests included (#)",
y = "Price per night (MXN $)") +
theme_bw()

cor(tidy_listings$guests_included, tidy_listings$price,
use = "complete.obs")
## [1] 0.267
The vast majority of the properties offered are meant to host only one guest and will charge a fee for any extra person.
Superhosts provide greater amenities
ggplot(tidy_listings, aes(x = amenities_clean,
y = price)) +
geom_point(aes(colour = host_is_superhost)) +
theme_minimal() +
scale_color_manual(values = c("darkorange", "cyan4"),
na.translate = FALSE) +
labs(title = "Superhosts provide greater amenities",
subtitle = "The effect of amenities on price across selected neighbourhoods",
x = "Number of amenities (#)",
y = "Price (MXN $)",
caption = "Source: insideairbnb.com",
colour = "Superhost") +
theme(legend.position = "bottom",
legend.background = element_rect(fill = "white", color = NA),
plot.title.position = "plot",
plot.caption = element_text(hjust = 0, face= "italic"),
plot.caption.position = "plot") +
scale_colour_discrete(labels = c("No", "Yes")) +
facet_wrap(~neighbourhood_simplified)
On average superhosts provide more amenities and this is consistent across all the selected neighbourhoods.
Mapping
Another fun part: displaying the housing offer in Mexico City on a map!
leaflet(data = filter(listings, minimum_nights <= 4)) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
fillColor = "blue",
fillOpacity = 0.4,
popup = ~listing_url,
label = ~property_type)
Regression
Let us finally get now more analytical: we fit our first regression model using twelve different independent variables which resulted in a R 2 of 0.471 . However, this model had too many variables , some of which were not that significant, so we removed the one with the largest p value which is cancellation policy.
tidy_listings$cancellation_policy <- as.factor(tidy_listings$cancellation_policy)
final_1 <- lm(log_price_4_nights ~ guests_included + amenities_clean + host_is_superhost + neighbourhood_simplified + room_type + prop_type_simplified + bathrooms + bedrooms + beds + is_location_exact + cancellation_policy + host_identity_verified, data = tidy_listings)
huxreg(final_1)
| (1) | |
|---|---|
| (Intercept) | 8.012 *** |
| (0.021) | |
| guests_included | 0.016 *** |
| (0.003) | |
| amenities_clean | 0.008 *** |
| (0.000) | |
| host_is_superhostTRUE | -0.020 * |
| (0.009) | |
| neighbourhood_simplifiedBenito Juárez | -0.145 *** |
| (0.017) | |
| neighbourhood_simplifiedCoyoacán | -0.172 *** |
| (0.018) | |
| neighbourhood_simplifiedCuauhtémoc | 0.032 * |
| (0.015) | |
| neighbourhood_simplifiedMiguel Hidalgo | 0.161 *** |
| (0.017) | |
| neighbourhood_simplifiedTlalpan | -0.305 *** |
| (0.021) | |
| room_typeHotel room | -0.206 *** |
| (0.036) | |
| room_typePrivate room | -0.635 *** |
| (0.009) | |
| room_typeShared room | -0.822 *** |
| (0.029) | |
| prop_type_simplifiedCondominium | 0.022 |
| (0.015) | |
| prop_type_simplifiedHouse | 0.045 *** |
| (0.012) | |
| prop_type_simplifiedLoft | -0.034 |
| (0.017) | |
| prop_type_simplifiedOther | 0.118 *** |
| (0.014) | |
| bathrooms | 0.032 *** |
| (0.006) | |
| bedrooms | 0.050 *** |
| (0.006) | |
| beds | 0.006 |
| (0.003) | |
| is_location_exactTRUE | 0.037 *** |
| (0.010) | |
| cancellation_policymoderate | 0.010 |
| (0.009) | |
| cancellation_policystrict_14_with_grace_period | 0.061 *** |
| (0.011) | |
| cancellation_policysuper_strict_30 | 0.903 * |
| (0.357) | |
| cancellation_policysuper_strict_60 | 0.485 |
| (0.293) | |
| host_identity_verifiedTRUE | 0.040 *** |
| (0.008) | |
| N | 17887 |
| R2 | 0.423 |
| logLik | -13119.507 |
| AIC | 26291.014 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
car::vif(final_1)
## GVIF Df GVIF^(1/(2*Df))
## guests_included 1.24 1 1.11
## amenities_clean 1.38 1 1.17
## host_is_superhost 1.18 1 1.09
## neighbourhood_simplified 1.19 5 1.02
## room_type 1.81 3 1.10
## prop_type_simplified 1.54 4 1.06
## bathrooms 2.45 1 1.57
## bedrooms 3.12 1 1.77
## beds 2.12 1 1.46
## is_location_exact 1.02 1 1.01
## cancellation_policy 1.17 4 1.02
## host_identity_verified 1.06 1 1.03
Seen from the statistics, R2 of the model is great with a value of 0.423 but this model has too many variables and some of which are not that significant so we will delete the one with the largest p value which is cancellation policy.
final_2 <- lm(log_price_4_nights ~ guests_included + amenities_clean + host_is_superhost + neighbourhood_simplified + room_type + prop_type_simplified + bathrooms + bedrooms + beds + is_location_exact + host_identity_verified, data = tidy_listings)
huxreg(final_2)
| (1) | |
|---|---|
| (Intercept) | 8.020 *** |
| (0.021) | |
| guests_included | 0.017 *** |
| (0.003) | |
| amenities_clean | 0.008 *** |
| (0.000) | |
| host_is_superhostTRUE | -0.016 |
| (0.009) | |
| neighbourhood_simplifiedBenito Juárez | -0.145 *** |
| (0.017) | |
| neighbourhood_simplifiedCoyoacán | -0.173 *** |
| (0.018) | |
| neighbourhood_simplifiedCuauhtémoc | 0.035 * |
| (0.015) | |
| neighbourhood_simplifiedMiguel Hidalgo | 0.163 *** |
| (0.017) | |
| neighbourhood_simplifiedTlalpan | -0.306 *** |
| (0.021) | |
| room_typeHotel room | -0.202 *** |
| (0.036) | |
| room_typePrivate room | -0.640 *** |
| (0.009) | |
| room_typeShared room | -0.825 *** |
| (0.029) | |
| prop_type_simplifiedCondominium | 0.021 |
| (0.015) | |
| prop_type_simplifiedHouse | 0.046 *** |
| (0.012) | |
| prop_type_simplifiedLoft | -0.033 |
| (0.017) | |
| prop_type_simplifiedOther | 0.115 *** |
| (0.014) | |
| bathrooms | 0.031 *** |
| (0.006) | |
| bedrooms | 0.050 *** |
| (0.006) | |
| beds | 0.006 |
| (0.003) | |
| is_location_exactTRUE | 0.037 *** |
| (0.010) | |
| host_identity_verifiedTRUE | 0.044 *** |
| (0.008) | |
| N | 17887 |
| R2 | 0.422 |
| logLik | -13141.496 |
| AIC | 26326.993 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
car::vif(final_2)
## GVIF Df GVIF^(1/(2*Df))
## guests_included 1.23 1 1.11
## amenities_clean 1.35 1 1.16
## host_is_superhost 1.17 1 1.08
## neighbourhood_simplified 1.18 5 1.02
## room_type 1.77 3 1.10
## prop_type_simplified 1.54 4 1.06
## bathrooms 2.45 1 1.56
## bedrooms 3.12 1 1.77
## beds 2.12 1 1.46
## is_location_exact 1.02 1 1.01
## host_identity_verified 1.05 1 1.02
We then fit the model with other 11 variables and this model gives us a R2 of 0.422 which is pretty close to before. Thus, we are pretty confident that cancellation policy does not have a significant effect on the price_4_night. Therefore, we continue to delete variables guests_included, amenities_clean, and beds which has largest p value.
We ran the model again with 8 variables but realised we had to further eliminate some independent variables to bring the AIC down.
final_3 <- lm(log_price_4_nights ~ host_is_superhost + neighbourhood_simplified + room_type + prop_type_simplified + bathrooms + bedrooms + is_location_exact + host_identity_verified, data = tidy_listings)
huxreg(final_3)
| (1) | |
|---|---|
| (Intercept) | 8.223 *** |
| (0.019) | |
| host_is_superhostTRUE | 0.042 *** |
| (0.008) | |
| neighbourhood_simplifiedBenito Juárez | -0.156 *** |
| (0.017) | |
| neighbourhood_simplifiedCoyoacán | -0.186 *** |
| (0.018) | |
| neighbourhood_simplifiedCuauhtémoc | 0.029 |
| (0.015) | |
| neighbourhood_simplifiedMiguel Hidalgo | 0.163 *** |
| (0.017) | |
| neighbourhood_simplifiedTlalpan | -0.330 *** |
| (0.021) | |
| room_typeHotel room | -0.205 *** |
| (0.036) | |
| room_typePrivate room | -0.693 *** |
| (0.009) | |
| room_typeShared room | -0.907 *** |
| (0.029) | |
| prop_type_simplifiedCondominium | 0.040 ** |
| (0.015) | |
| prop_type_simplifiedHouse | 0.047 *** |
| (0.012) | |
| prop_type_simplifiedLoft | -0.033 |
| (0.017) | |
| prop_type_simplifiedOther | 0.104 *** |
| (0.014) | |
| bathrooms | 0.041 *** |
| (0.006) | |
| bedrooms | 0.057 *** |
| (0.006) | |
| is_location_exactTRUE | 0.037 *** |
| (0.010) | |
| host_identity_verifiedTRUE | 0.066 *** |
| (0.008) | |
| N | 18069 |
| R2 | 0.407 |
| logLik | -13525.054 |
| AIC | 27088.108 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
car::vif(final_3)
## GVIF Df GVIF^(1/(2*Df))
## host_is_superhost 1.05 1 1.02
## neighbourhood_simplified 1.17 5 1.02
## room_type 1.53 3 1.07
## prop_type_simplified 1.51 4 1.05
## bathrooms 2.28 1 1.51
## bedrooms 2.42 1 1.56
## is_location_exact 1.02 1 1.01
## host_identity_verified 1.03 1 1.01
We once again fit the model with other 8 variables and this model gives us a R2 of 0.407 and as we can tell, there are still too many variables in the model.
final_4 <- lm(log_price_4_nights ~ neighbourhood_simplified + room_type + bathrooms + bedrooms, data = tidy_listings)
huxreg(final_4)
| (1) | |
|---|---|
| (Intercept) | 8.300 *** |
| (0.016) | |
| neighbourhood_simplifiedBenito Juárez | -0.164 *** |
| (0.017) | |
| neighbourhood_simplifiedCoyoacán | -0.180 *** |
| (0.018) | |
| neighbourhood_simplifiedCuauhtémoc | 0.030 * |
| (0.015) | |
| neighbourhood_simplifiedMiguel Hidalgo | 0.162 *** |
| (0.017) | |
| neighbourhood_simplifiedTlalpan | -0.333 *** |
| (0.021) | |
| room_typeHotel room | -0.132 *** |
| (0.034) | |
| room_typePrivate room | -0.688 *** |
| (0.008) | |
| room_typeShared room | -0.900 *** |
| (0.029) | |
| bathrooms | 0.046 *** |
| (0.006) | |
| bedrooms | 0.056 *** |
| (0.005) | |
| N | 18069 |
| R2 | 0.402 |
| logLik | -13609.586 |
| AIC | 27243.172 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
car::vif(final_4)
## GVIF Df GVIF^(1/(2*Df))
## neighbourhood_simplified 1.04 5 1.00
## room_type 1.13 3 1.02
## bathrooms 2.25 1 1.50
## bedrooms 2.37 1 1.54
autoplot(final_4)

summary(final_4)
##
## Call:
## lm(formula = log_price_4_nights ~ neighbourhood_simplified +
## room_type + bathrooms + bedrooms, data = tidy_listings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.690 -0.334 -0.014 0.323 2.539
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.30001 0.01591 521.58 < 2e-16 ***
## neighbourhood_simplifiedBenito Juárez -0.16428 0.01655 -9.93 < 2e-16 ***
## neighbourhood_simplifiedCoyoacán -0.18016 0.01832 -9.83 < 2e-16 ***
## neighbourhood_simplifiedCuauhtémoc 0.03013 0.01507 2.00 0.046 *
## neighbourhood_simplifiedMiguel Hidalgo 0.16214 0.01703 9.52 < 2e-16 ***
## neighbourhood_simplifiedTlalpan -0.33312 0.02068 -16.11 < 2e-16 ***
## room_typeHotel room -0.13207 0.03391 -3.89 9.9e-05 ***
## room_typePrivate room -0.68779 0.00820 -83.88 < 2e-16 ***
## room_typeShared room -0.90030 0.02868 -31.39 < 2e-16 ***
## bathrooms 0.04560 0.00605 7.53 5.2e-14 ***
## bedrooms 0.05559 0.00549 10.12 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.514 on 18058 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.402, Adjusted R-squared: 0.401
## F-statistic: 1.21e+03 on 10 and 18058 DF, p-value: <2e-16
For our final model, we take out is_location_exact, host_is_superhost,and host_identity-verified and the R2 is 0.402.
target_listings <- tidy_listings %>%
dplyr::filter(number_of_reviews > 15,
review_scores_rating > 95,
room_type == "Private room",
prop_type_simplified == "Apartment",
neighbourhood_simplified == "Tlalpan")
We are essentially facing a clear trade off whilst building our model and had to weigh the benefits of additional variables (higher R 2) against the amount of error in our model ( higher AIC ).
ci <- predict(final_4, newdata = target_listings, interval = "confidence")
ci = exp(ci)
summary(ci)
## fit lwr upr
## Min. :1517 Min. :1469 Min. :1567
## 1st Qu.:1604 1st Qu.:1556 1st Qu.:1654
## Median :1604 Median :1556 Median :1654
## Mean :1606 Mean :1557 Mean :1656
## 3rd Qu.:1604 3rd Qu.:1556 3rd Qu.:1654
## Max. :1679 Max. :1626 Max. :1734
For prediction and confidence interval, we get a median of 2063 as our lower bound, 2035 as our prediction, and 2090 as our higher bound.
model1 <- lm(price_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating, data = tidy_listings)
model1
##
## Call:
## lm(formula = price_4_nights ~ prop_type_simplified + number_of_reviews +
## review_scores_rating, data = tidy_listings)
##
## Coefficients:
## (Intercept) prop_type_simplifiedCondominium
## 3003.87 240.01
## prop_type_simplifiedHouse prop_type_simplifiedLoft
## -1318.25 178.82
## prop_type_simplifiedOther number_of_reviews
## -422.57 4.53
## review_scores_rating
## 10.55
model2 <- lm(price_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type, data = tidy_listings)
model2
##
## Call:
## lm(formula = price_4_nights ~ prop_type_simplified + number_of_reviews +
## review_scores_rating + room_type, data = tidy_listings)
##
## Coefficients:
## (Intercept) prop_type_simplifiedCondominium
## 4367.016 222.816
## prop_type_simplifiedHouse prop_type_simplifiedLoft
## 21.968 -522.028
## prop_type_simplifiedOther number_of_reviews
## 55.893 -0.375
## review_scores_rating room_typeHotel room
## 9.637 -565.933
## room_typePrivate room room_typeShared room
## -2855.795 -3492.220
For each increase in review_scores_rating, the price of two people for 4 nights will increase by 10.55.
Furthermore, we notice the following change in prices according to the property type:
- if the property type is condominium, the price of two people for 4 nights will increase by 240.0
- if the property type is house, the price of two people for 4 nights will decrease by 1318.3
- if the property type is loft, the price of two people for 4 nights will increase by 178.8
- if the property type is other, the price of two people for 4 nights will decrease by 422.6
Outcomes and concerns
To check that our results were valid we searched Airbnb to find out the rough cost of two people staying four nights in Tlalpan (Mexico City) the results showed an average price of 2,100 MXN$ which is comparable our findings .
Our model is rather limited in its use as it focuses on Mexico City alone to make this a general model we would need Airbnb data from around the world.