AirBnb analytics: Mexico City

Overview

Using Airbnb data, we created a model with the aim to analyse data about Airbnb listings and predict the total cost for two people staying four nights in Mexico city.

To create a successful model we had to:

Apply data-cleaning techniques to extract data from Airbnb list and create a dataset.
Use descriptive statistics to conduct data modification and relevant analysis. Apply multi-regression modeling to visualize and analyze the data that we have collected.

Our final model takes into consideration the:

Neighbourhood
Room type
N. of bathrooms
N. of bedrooms

The model predicts that an apartment with a private room in the neighbourhoood of Tlalpan, Mexico City, which has over 15 reviews and an average review score above 95/100 will cost, on average, 2,063 MXN$ with a 95% confidence interval ranging from 2,035 to 2,090.

Loading the data

We use as source: www.insideairbnb.com Airbnb listings data was scraped directlyfrom airbnb.com

listings <- vroom::vroom("http://data.insideairbnb.com/mexico/df/mexico-city/2020-06-20/data/listings.csv.gz")

Exploratory Data Analysis (EDA)

Inspecting the data

In the original data set, we collected 21,824 values across 106 variables of which 52 are character, 15 are logical, and 38 are numerical. As we can tell, many variables are either URLs or descriptions which could not be used as predictive variables.

Thus, we first applied glimpse and skim to look through the data set and chose what we believe could be used in the later analysis stage.

Glimpse and skim are two of my favourite functions (others being table, favstats, and summary) allow us to understand better the dataset we are working with:

Glimpse gives us a summary of the variables and observations included.
Skim provides us with useful information concerning the missing data, a.k.a. NAs

Here, we start selecting specific variable that may be useful for our deep analysis later in the process.

tidy_listings <- listings %>% 
dplyr::select(host_is_superhost, neighbourhood_cleansed, latitude, longitude, is_location_exact, property_type, room_type,
              bathrooms, bedrooms, beds, square_feet, price, cleaning_fee, extra_people, number_of_reviews, review_scores_rating,
              cancellation_policy, availability_365, amenities, host_identity_verified, minimum_nights, guests_included)

Looking through the raw data set, we will drop variable square_feet because most airbnb houses (21,758) do not provide it, so it will not be a meaningful predictive variable and a large amount of NA values will influence our analysis later.

tidy_listings <- tidy_listings %>% 
dplyr::select(-square_feet)

Data Wrangling

Seen from the data set, we find that variables price, cleaning_fee, and extra_people should have been numerical but are characters, so we will change them back to numbers.

Also, it is not meaningful to have too many neighbourhoods for our later analysis. Thus, we use six neighbourhoods with the largest Airbnb counts and combine other neighbourhoods into these six according to geographical closeness.

Besides, amenities are in lists right now which could not be used to predict price so we create a new variable named amenities_clean that stores amenity counts.

Since the vast majority of the observations in the data are one of the top four or five property types, we would like to create a simplified version of property_type variable that has five categories: the top four categories and Other.

Moreover, as we only care about houses that offer four-night renting, we filter data set to only include minimum nights less than or equal to four nights.

Lastly, We also create a new variable named price_4_nights to calculate the total cost for two people to stay at the Airbnb property for four nights.

### Changing characters variables to numerical ones
tidy_listings <- tidy_listings %>%
  mutate(price = parse_number(price), cleaning_fee = parse_number(cleaning_fee), 
       extra_people = parse_number(extra_people))

### Creating neighbourhood_cleased variable
tidy_listings$neighbourhood_cleansed <- as.factor(tidy_listings$neighbourhood_cleansed)

tidy_listings <- tidy_listings %>%
  mutate(neighbourhood_simplified = case_when(
    neighbourhood_cleansed %in% c("Cuauhtémoc", "Venustiano Carranza", "Gustavo A. Madero") ~ "Cuauhtémoc", 
    neighbourhood_cleansed %in% c("Miguel Hidalgo", "Azcapotzalaco") ~ "Miguel Hidalgo",
    neighbourhood_cleansed %in% c("Benito Juárez", "Iztacalco") ~ "Benito Juárez",
    neighbourhood_cleansed %in% c("Coyoacán", "Iztapalapa") ~ "Coyoacán",
    neighbourhood_cleansed %in% c("Álvaro Obregón", "Cuajimalpa de Morelos", "La Magdalena Contreras") ~ "Álvaro Obregón",
    TRUE ~ "Tlalpan"
  ))

### Creating amenities_clean variable 
tidy_listings <- tidy_listings %>% 
  mutate(amenities_clean = count.fields(textConnection(amenities), sep = ","))

### Sorting the data to show the four most common properties types
sort(table(tidy_listings$property_type),decreasing=TRUE)[1:4]

## 
##   Apartment       House Condominium        Loft 
##       13545        3289        1577        1120

### Sorting the data such that the top four property types have their own category and the remaining categories fall into 'other'.
tidy_listings <- tidy_listings %>%
  mutate(prop_type_simplified = case_when(
    property_type %in% c("Apartment","House", "Condominium","Loft") ~ property_type, 
    TRUE ~ "Other"
  ))

### Filtering minimum nights lower than or equal to 4 nights
tidy_listings <- filter(tidy_listings, minimum_nights <= 4)

### Change NA values to 0 in cleaning fees
tidy_listings <- tidy_listings %>%
  mutate(cleaning_fee = case_when(
    is.na(cleaning_fee) ~ 0, 
    TRUE ~ cleaning_fee
  ))

### Create a new variable price_4_nights
tidy_listings <- tidy_listings %>% 
  mutate(price_4_nights = case_when(
    guests_included >= 2 ~ 4*price + cleaning_fee,
    guests_included < 2 ~ 4*price + cleaning_fee + 4*extra_people))

Now, variables are manipulated and adjusted. We assumed ‘NA’ values appear when the host has left the additional cleaning fee box blank, therefore, we set the NAs to 0.

We have more than 5,000 NA values in column review_scores_ratings so we will first plot a scatterplot to see if there is a strong relationship between review score and price.

ggplot(tidy_listings, aes(x = review_scores_rating,
                          y = price)) + 
  geom_point()

cor(tidy_listings$review_scores_rating, tidy_listings$price,
    use = "complete.obs")

## [1] 0.00346

We fond out that the correlation between review score and price is only 0.0346 which means that there is nearly no linear relationship between them. We, thus, remove this variable from our data set.

Handling Missing values

Let us drop all the rows which include NA values in our data set.

The point is that we cannot keep missing observations because they will mess up our regressions later in the analysis.

# tidy_listings %>%
  #dplyr::filter(!is.na(number_of_reviews),
         #!is.na(bathrooms),
         #!is.na(bedrooms),
         #!is.na(beds),
         #!is.na(review_scores_rating),
         #!is.na(cancellation_policy))

Visualisations

My favourite part comes with nice plots and graphs about different variables for a clearer visualisation of the dataset.

Price variable analysis

summary(tidy_listings$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     408     747    1509    1358  349990

ggplot(tidy_listings, aes(x=price)) +
  geom_density()

From summary statistics and distribution, we find out that the median of price is 747 but the maximum is 349,990 which does not make sense and the distribution is highly right-skewed.

Thus, we will include only price under 3,000 in our data set because we believe this is a reasonable price range, and any values higher than that are outliers.

Density plot of price-4-nights

#
tidy_listings <- tidy_listings %>% 
  filter(price <= 3000)

#
ggplot(tidy_listings, aes(x = price_4_nights)) +
  geom_density()

ggplot(tidy_listings, aes(x = price_4_nights)) +
  scale_x_log10() +
  geom_density()

#log_price_4_nights
tidy_listings <- tidy_listings %>% 
  mutate (log_price_4_nights = log(price_4_nights))

Then, we plot out the density plot of price_4_nights and find out that the distribution is highly right-skewed which is not appropriate to be used as dependent variable so we try to plot log10 density plot and, now, it seems like a normal distribution. Thus, we create a new variable named log_price_4_nights to use in later regression.

General overview

We use the function GGally::ggpairs() to grab anb overview of the distributions and correlations in our dataset.

ggpairs1 <- tidy_listings %>%
 dplyr::select( price, cleaning_fee, extra_people, number_of_reviews,
                guests_included, room_type, host_is_superhost, prop_type_simplified) %>%
   rename("Price" = price,
          "Cleaning fee" = cleaning_fee,
          "Extra people" = extra_people,
          "N. Reviews" = number_of_reviews,
          "Guests included" = guests_included,
          "Room type" = room_type,
          "Superhost" = host_is_superhost) %>% 
  GGally::ggpairs() +
  theme_minimal() 

ggpairs1

Central Mexico City is for the rich

ggplot(tidy_listings, aes(x = reorder(neighbourhood_simplified, price),
                          y = price, 
                          fill = neighbourhood_simplified)) +
  geom_boxplot(alpha = 0.6) +
  labs(title = "Central Mexico City is for the rich",
       subtitle = "Boxplot of prices across selected neighbourhoods",
       caption = "Source: insideairbnb.com",
       x= "Neighbourhood" ,
       y = "Price per night (MXN $)") +
  theme_bw() +
  scale_fill_brewer(palette = "Set3") +
  scale_fill_discrete(name = "Neighbourhood") +
  theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

It is evident and logical that the most central areas such as Miguel Hidalgo, and Cuauhtemoc Alvaro Obregon of Mexico City are the most high-priced too.

Everyone dreams about a loft in Mexico City

prop_plot <- tidy_listings  %>%
  filter(price <= 2000)

ggplot(prop_plot, aes(reorder(x = prop_type_simplified, price),
                          y = price,
                          fill = prop_type_simplified)) +
  geom_boxplot(alpha = 0.6) +
  labs(title = "Everyone dreams about a loft in Mexico City",
       subtitle = "Boxplot of prices across selected types of properties",
       caption = "Source: insideairbnb.com",
       x= "Property type" ,
       y = "Price per night (MXN $)") +
  theme_bw() +
  scale_fill_discrete(name = "Property type") +
  theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

On average, lofts and condominia are the most exclusive kinds of property where a couple could stay for a few night in Mexico City. However, there are a lot of outliers across all the categories considered that may confuse the analysis.

Couples seek the right intimacy in Mexico City

Bar_chart <- tidy_listings %>%
   mutate('room_type' = factor(`room_type`, levels = c("Hotel room",
                                                       "Shared room",
                                                       "Entire home/apt",
                                                       "Private room")))

ggplot(Bar_chart, aes(x= room_type,
                      fill = room_type)) +
  geom_bar(alpha = 0.6) +
  theme_bw() +
  labs(title = "Couples seek the right intimacy in Mexico City",
       subtitle = "Total number of rooms by most common types" ,
       caption = "Source: insideairbnb.com",
       x= "Room type",
       y = "Count" ) +
  scale_fill_discrete(name = "Room type") +
  theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

The vast majority of habitations offered in Mexico City are either a private room or an entire home/apartment.

Superhost can be super… expensive

ggplot(tidy_listings, aes(x = host_is_superhost,
                          y = price,
                          fill = host_is_superhost)) +
  geom_boxplot(alpha = 0.6) +
  labs(title = "Superhost can be super... expensive",
       subtitle = "Boxplot of prices for superhosts and normal hosts",
       caption = "Source: insideairbnb.com",
       x= "Superhost",
       y = "Price per night (MXN $)") +
   theme_bw() +
  theme(legend.position = "none") +
  scale_x_discrete(labels = c("No", "Yes"))

Superhosts afford to charge more for the “superhost” brand which inspires more professionalism and safety to prospective visitors.

Correlation test between n. of reviews and price

From the scatterplot, we can notice a low correlation between the two variables. The distribution suggests us that the mean is close to 0. Therefore, we run a correlation test to check it.

ggplot(tidy_listings, aes(x= number_of_reviews,
                          y= price)) +
  geom_point() + 
  labs(x = "Number of reviews (#)",
       y = "Price per night (MXN $)") +
  theme_bw()

ggplot(tidy_listings, aes(x = number_of_reviews)) +
  geom_density() + 
  labs(x = "Number of reviews (#)",
       y = "Density") +
  theme_bw()

cor(tidy_listings$number_of_reviews, tidy_listings$price,
    use = "complete.obs")

## [1] 0.103

tidy_review <- tidy_listings %>% 
  filter(number_of_reviews <= 20)
ggplot(tidy_review, aes(x= number_of_reviews,
                        y= price)) +
  geom_point() + 
  labs(x = "Number of reviews (#)",
       y = "Price per night (MXN $)") +
  theme_bw()

cor(tidy_review$number_of_reviews, tidy_review$price,
    use = "complete.obs")

## [1] 0.0354

Most hosts don’t charge extra for cleaning

cleaning_hist <- tidy_listings %>%  filter(cleaning_fee <= 1000) 


ggplot(cleaning_hist, aes(x= cleaning_fee)) +
  geom_histogram(binwidth = 100) +
  theme_bw() +
  labs(title = "Most hosts don't charge extra for cleaning",
       subtitle = "Distribution of cleaning fee charge",
       caption = "Source: insideairbnb.com",
       x= "Cleaning fee (MXN $)",
       y = "Count") +
  scale_fill_discrete()

We have filtered out hosts that apply cleaning fees above $1,000 as they are outliers.

Sometimes not everyone is welcome

ggplot(tidy_listings, aes(x= guests_included,
                          y= price),
       fill = guests_included) +
  geom_bar(stat="identity") +
  theme_bw() +
  labs(title = "Sometimes not everyone is welcome",
       subtitle = "Number of guests included in the abitation",
       caption = "Source: insideairbnb.com",
       x= "Guests included (#)",
       y = "Price per night (MXN $)" )

ggplot(tidy_listings, aes(x = guests_included,
                          y = price)) +
  geom_point() +
  labs(x = "Guests included (#)",
       y = "Price per night (MXN $)") +
  theme_bw()

cor(tidy_listings$guests_included, tidy_listings$price,
    use = "complete.obs")

## [1] 0.267

The vast majority of the properties offered are meant to host only one guest and will charge a fee for any extra person.

Superhosts provide greater amenities

ggplot(tidy_listings, aes(x = amenities_clean,
                            y = price)) +
  geom_point(aes(colour = host_is_superhost)) +
  theme_minimal() +
  scale_color_manual(values = c("darkorange", "cyan4"),
                     na.translate = FALSE) +
  labs(title = "Superhosts provide greater amenities",
       subtitle = "The effect of amenities on price across selected neighbourhoods",
       x = "Number of amenities (#)",
       y = "Price (MXN $)",
       caption = "Source: insideairbnb.com",
       colour = "Superhost") +
  theme(legend.position = "bottom",
        legend.background = element_rect(fill = "white", color = NA),
        plot.title.position = "plot",
        plot.caption = element_text(hjust = 0, face= "italic"),
        plot.caption.position = "plot") +
  scale_colour_discrete(labels = c("No", "Yes")) +
  facet_wrap(~neighbourhood_simplified)

On average superhosts provide more amenities and this is consistent across all the selected neighbourhoods.

Mapping

Another fun part: displaying the housing offer in Mexico City on a map!

leaflet(data = filter(listings, minimum_nights <= 4)) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   fillColor = "blue", 
                   fillOpacity = 0.4, 
                   popup = ~listing_url,
                   label = ~property_type)

Regression

Let us finally get now more analytical: we fit our first regression model using twelve different independent variables which resulted in a R 2 of 0.471 . However, this model had too many variables , some of which were not that significant, so we removed the one with the largest p value which is cancellation policy.

tidy_listings$cancellation_policy <- as.factor(tidy_listings$cancellation_policy)
 
final_1 <- lm(log_price_4_nights ~ guests_included + amenities_clean + host_is_superhost + neighbourhood_simplified + room_type + prop_type_simplified + bathrooms + bedrooms + beds + is_location_exact + cancellation_policy + host_identity_verified, data = tidy_listings)

huxreg(final_1)

	(1)
(Intercept)	8.012 ***
	(0.021)
guests_included	0.016 ***
	(0.003)
amenities_clean	0.008 ***
	(0.000)
host_is_superhostTRUE	-0.020 *
	(0.009)
neighbourhood_simplifiedBenito Juárez	-0.145 ***
	(0.017)
neighbourhood_simplifiedCoyoacán	-0.172 ***
	(0.018)
neighbourhood_simplifiedCuauhtémoc	0.032 *
	(0.015)
neighbourhood_simplifiedMiguel Hidalgo	0.161 ***
	(0.017)
neighbourhood_simplifiedTlalpan	-0.305 ***
	(0.021)
room_typeHotel room	-0.206 ***
	(0.036)
room_typePrivate room	-0.635 ***
	(0.009)
room_typeShared room	-0.822 ***
	(0.029)
prop_type_simplifiedCondominium	0.022
	(0.015)
prop_type_simplifiedHouse	0.045 ***
	(0.012)
prop_type_simplifiedLoft	-0.034
	(0.017)
prop_type_simplifiedOther	0.118 ***
	(0.014)
bathrooms	0.032 ***
	(0.006)
bedrooms	0.050 ***
	(0.006)
beds	0.006
	(0.003)
is_location_exactTRUE	0.037 ***
	(0.010)
cancellation_policymoderate	0.010
	(0.009)
cancellation_policystrict_14_with_grace_period	0.061 ***
	(0.011)
cancellation_policysuper_strict_30	0.903 *
	(0.357)
cancellation_policysuper_strict_60	0.485
	(0.293)
host_identity_verifiedTRUE	0.040 ***
	(0.008)
N	17887
R2	0.423
logLik	-13119.507
AIC	26291.014
* p < 0.001; p < 0.01; * p < 0.05.

car::vif(final_1)

##                          GVIF Df GVIF^(1/(2*Df))
## guests_included          1.24  1            1.11
## amenities_clean          1.38  1            1.17
## host_is_superhost        1.18  1            1.09
## neighbourhood_simplified 1.19  5            1.02
## room_type                1.81  3            1.10
## prop_type_simplified     1.54  4            1.06
## bathrooms                2.45  1            1.57
## bedrooms                 3.12  1            1.77
## beds                     2.12  1            1.46
## is_location_exact        1.02  1            1.01
## cancellation_policy      1.17  4            1.02
## host_identity_verified   1.06  1            1.03

Seen from the statistics, R2 of the model is great with a value of 0.423 but this model has too many variables and some of which are not that significant so we will delete the one with the largest p value which is cancellation policy.

final_2 <- lm(log_price_4_nights ~ guests_included + amenities_clean + host_is_superhost + neighbourhood_simplified + room_type + prop_type_simplified + bathrooms + bedrooms + beds + is_location_exact + host_identity_verified, data = tidy_listings)

huxreg(final_2)

	(1)
(Intercept)	8.020 ***
	(0.021)
guests_included	0.017 ***
	(0.003)
amenities_clean	0.008 ***
	(0.000)
host_is_superhostTRUE	-0.016
	(0.009)
neighbourhood_simplifiedBenito Juárez	-0.145 ***
	(0.017)
neighbourhood_simplifiedCoyoacán	-0.173 ***
	(0.018)
neighbourhood_simplifiedCuauhtémoc	0.035 *
	(0.015)
neighbourhood_simplifiedMiguel Hidalgo	0.163 ***
	(0.017)
neighbourhood_simplifiedTlalpan	-0.306 ***
	(0.021)
room_typeHotel room	-0.202 ***
	(0.036)
room_typePrivate room	-0.640 ***
	(0.009)
room_typeShared room	-0.825 ***
	(0.029)
prop_type_simplifiedCondominium	0.021
	(0.015)
prop_type_simplifiedHouse	0.046 ***
	(0.012)
prop_type_simplifiedLoft	-0.033
	(0.017)
prop_type_simplifiedOther	0.115 ***
	(0.014)
bathrooms	0.031 ***
	(0.006)
bedrooms	0.050 ***
	(0.006)
beds	0.006
	(0.003)
is_location_exactTRUE	0.037 ***
	(0.010)
host_identity_verifiedTRUE	0.044 ***
	(0.008)
N	17887
R2	0.422
logLik	-13141.496
AIC	26326.993
* p < 0.001; p < 0.01; * p < 0.05.

car::vif(final_2)

##                          GVIF Df GVIF^(1/(2*Df))
## guests_included          1.23  1            1.11
## amenities_clean          1.35  1            1.16
## host_is_superhost        1.17  1            1.08
## neighbourhood_simplified 1.18  5            1.02
## room_type                1.77  3            1.10
## prop_type_simplified     1.54  4            1.06
## bathrooms                2.45  1            1.56
## bedrooms                 3.12  1            1.77
## beds                     2.12  1            1.46
## is_location_exact        1.02  1            1.01
## host_identity_verified   1.05  1            1.02

We then fit the model with other 11 variables and this model gives us a R2 of 0.422 which is pretty close to before. Thus, we are pretty confident that cancellation policy does not have a significant effect on the price_4_night. Therefore, we continue to delete variables guests_included, amenities_clean, and beds which has largest p value.

We ran the model again with 8 variables but realised we had to further eliminate some independent variables to bring the AIC down.

final_3 <- lm(log_price_4_nights ~ host_is_superhost + neighbourhood_simplified + room_type + prop_type_simplified + bathrooms + bedrooms + is_location_exact + host_identity_verified, data = tidy_listings)

huxreg(final_3)

	(1)
(Intercept)	8.223 ***
	(0.019)
host_is_superhostTRUE	0.042 ***
	(0.008)
neighbourhood_simplifiedBenito Juárez	-0.156 ***
	(0.017)
neighbourhood_simplifiedCoyoacán	-0.186 ***
	(0.018)
neighbourhood_simplifiedCuauhtémoc	0.029
	(0.015)
neighbourhood_simplifiedMiguel Hidalgo	0.163 ***
	(0.017)
neighbourhood_simplifiedTlalpan	-0.330 ***
	(0.021)
room_typeHotel room	-0.205 ***
	(0.036)
room_typePrivate room	-0.693 ***
	(0.009)
room_typeShared room	-0.907 ***
	(0.029)
prop_type_simplifiedCondominium	0.040 **
	(0.015)
prop_type_simplifiedHouse	0.047 ***
	(0.012)
prop_type_simplifiedLoft	-0.033
	(0.017)
prop_type_simplifiedOther	0.104 ***
	(0.014)
bathrooms	0.041 ***
	(0.006)
bedrooms	0.057 ***
	(0.006)
is_location_exactTRUE	0.037 ***
	(0.010)
host_identity_verifiedTRUE	0.066 ***
	(0.008)
N	18069
R2	0.407
logLik	-13525.054
AIC	27088.108
* p < 0.001; p < 0.01; * p < 0.05.

car::vif(final_3)

##                          GVIF Df GVIF^(1/(2*Df))
## host_is_superhost        1.05  1            1.02
## neighbourhood_simplified 1.17  5            1.02
## room_type                1.53  3            1.07
## prop_type_simplified     1.51  4            1.05
## bathrooms                2.28  1            1.51
## bedrooms                 2.42  1            1.56
## is_location_exact        1.02  1            1.01
## host_identity_verified   1.03  1            1.01

We once again fit the model with other 8 variables and this model gives us a R2 of 0.407 and as we can tell, there are still too many variables in the model.

final_4 <- lm(log_price_4_nights ~ neighbourhood_simplified + room_type + bathrooms + bedrooms, data = tidy_listings)

huxreg(final_4)

	(1)
(Intercept)	8.300 ***
	(0.016)
neighbourhood_simplifiedBenito Juárez	-0.164 ***
	(0.017)
neighbourhood_simplifiedCoyoacán	-0.180 ***
	(0.018)
neighbourhood_simplifiedCuauhtémoc	0.030 *
	(0.015)
neighbourhood_simplifiedMiguel Hidalgo	0.162 ***
	(0.017)
neighbourhood_simplifiedTlalpan	-0.333 ***
	(0.021)
room_typeHotel room	-0.132 ***
	(0.034)
room_typePrivate room	-0.688 ***
	(0.008)
room_typeShared room	-0.900 ***
	(0.029)
bathrooms	0.046 ***
	(0.006)
bedrooms	0.056 ***
	(0.005)
N	18069
R2	0.402
logLik	-13609.586
AIC	27243.172
* p < 0.001; p < 0.01; * p < 0.05.

car::vif(final_4)

##                          GVIF Df GVIF^(1/(2*Df))
## neighbourhood_simplified 1.04  5            1.00
## room_type                1.13  3            1.02
## bathrooms                2.25  1            1.50
## bedrooms                 2.37  1            1.54

autoplot(final_4)

summary(final_4)

## 
## Call:
## lm(formula = log_price_4_nights ~ neighbourhood_simplified + 
##     room_type + bathrooms + bedrooms, data = tidy_listings)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.690 -0.334 -0.014  0.323  2.539 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                             8.30001    0.01591  521.58  < 2e-16 ***
## neighbourhood_simplifiedBenito Juárez  -0.16428    0.01655   -9.93  < 2e-16 ***
## neighbourhood_simplifiedCoyoacán       -0.18016    0.01832   -9.83  < 2e-16 ***
## neighbourhood_simplifiedCuauhtémoc      0.03013    0.01507    2.00    0.046 *  
## neighbourhood_simplifiedMiguel Hidalgo  0.16214    0.01703    9.52  < 2e-16 ***
## neighbourhood_simplifiedTlalpan        -0.33312    0.02068  -16.11  < 2e-16 ***
## room_typeHotel room                    -0.13207    0.03391   -3.89  9.9e-05 ***
## room_typePrivate room                  -0.68779    0.00820  -83.88  < 2e-16 ***
## room_typeShared room                   -0.90030    0.02868  -31.39  < 2e-16 ***
## bathrooms                               0.04560    0.00605    7.53  5.2e-14 ***
## bedrooms                                0.05559    0.00549   10.12  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.514 on 18058 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.402,  Adjusted R-squared:  0.401 
## F-statistic: 1.21e+03 on 10 and 18058 DF,  p-value: <2e-16

For our final model, we take out is_location_exact, host_is_superhost,and host_identity-verified and the R2 is 0.402.

target_listings <- tidy_listings %>%
dplyr::filter(number_of_reviews > 15,
         review_scores_rating > 95,
         room_type == "Private room",
         prop_type_simplified == "Apartment",
         neighbourhood_simplified == "Tlalpan")

We are essentially facing a clear trade off whilst building our model and had to weigh the benefits of additional variables (higher R 2) against the amount of error in our model ( higher AIC ).

ci <- predict(final_4, newdata = target_listings, interval = "confidence")

ci = exp(ci)

summary(ci)

##       fit            lwr            upr      
##  Min.   :1517   Min.   :1469   Min.   :1567  
##  1st Qu.:1604   1st Qu.:1556   1st Qu.:1654  
##  Median :1604   Median :1556   Median :1654  
##  Mean   :1606   Mean   :1557   Mean   :1656  
##  3rd Qu.:1604   3rd Qu.:1556   3rd Qu.:1654  
##  Max.   :1679   Max.   :1626   Max.   :1734

For prediction and confidence interval, we get a median of 2063 as our lower bound, 2035 as our prediction, and 2090 as our higher bound.

model1 <- lm(price_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating, data = tidy_listings)
model1

## 
## Call:
## lm(formula = price_4_nights ~ prop_type_simplified + number_of_reviews + 
##     review_scores_rating, data = tidy_listings)
## 
## Coefficients:
##                     (Intercept)  prop_type_simplifiedCondominium  
##                         3003.87                           240.01  
##       prop_type_simplifiedHouse         prop_type_simplifiedLoft  
##                        -1318.25                           178.82  
##       prop_type_simplifiedOther                number_of_reviews  
##                         -422.57                             4.53  
##            review_scores_rating  
##                           10.55

model2 <- lm(price_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type, data = tidy_listings)
model2

## 
## Call:
## lm(formula = price_4_nights ~ prop_type_simplified + number_of_reviews + 
##     review_scores_rating + room_type, data = tidy_listings)
## 
## Coefficients:
##                     (Intercept)  prop_type_simplifiedCondominium  
##                        4367.016                          222.816  
##       prop_type_simplifiedHouse         prop_type_simplifiedLoft  
##                          21.968                         -522.028  
##       prop_type_simplifiedOther                number_of_reviews  
##                          55.893                           -0.375  
##            review_scores_rating              room_typeHotel room  
##                           9.637                         -565.933  
##           room_typePrivate room             room_typeShared room  
##                       -2855.795                        -3492.220

For each increase in review_scores_rating, the price of two people for 4 nights will increase by 10.55.

Furthermore, we notice the following change in prices according to the property type:

if the property type is condominium, the price of two people for 4 nights will increase by 240.0
if the property type is house, the price of two people for 4 nights will decrease by 1318.3
if the property type is loft, the price of two people for 4 nights will increase by 178.8
if the property type is other, the price of two people for 4 nights will decrease by 422.6

Outcomes and concerns

To check that our results were valid we searched Airbnb to find out the rough cost of two people staying four nights in Tlalpan (Mexico City) the results showed an average price of 2,100 MXN$ which is comparable our findings .
Our model is rather limited in its use as it focuses on Mexico City alone to make this a general model we would need Airbnb data from around the world.