QUESTION IMAGE
Question
introduction
this assignment asks you to carry out a complete statistical inference workflow from start to finish, using a single real-world dataset. the assignment is divided into two parts: question 1 focuses on a continuous (quantitative) outcome variable and will require you to use at least one interaction term in your fitted model, while question 2 focuses on a binary outcome variable. both parts will follow the same four-phase workflow:
(i) pre-specified plan. during this phase, you will provide your list of predictors of your choice. before running any model, you will specify what your hypotheses are given those chosen variables.
(ii) descriptive statistics and data manipulation. during this phase, you will look through the data and try to understand some of its nuances. this will allow you to see whether your chosen predictors are missing any values, whether they exhibit some skewness, or whether they are categorical, numerical, or other. here, you will also apply any data manipulations that are appropriate for either the predictors you have chosen or for the outcome variable (e.g. removal of missing value, transformation of a predictor).
(iii) primary analysis. during this phase, you will carry out the model you specified in your pre-specified plan.
(iv) secondary analysis. during this phase, you will re-examine and stress tests the findings of your primary analysis, hoping to get a better idea of how your results fit in within the broader context of possible results one might have gotten from this dataset.
it is important that you complete phase (i) before the subsequent phases in each part of the assignment; review lecture 13 if you need a refresher on why defining our model before doing any inference is important for the validity and truthfulness of the findings.
dataset
the dataset, provided in the hotel_bookings.csv file, contains records of roughly 120,000 individual hotel bookings across two hotels in portugal. bookings span arrivals to the hotels between july 2015 and august 2017 and include both those that were seen through (honored) and that were not (canceled).
this is not a perfect dataset. there are some strings with value null (instead of a true missing value, na) in some rows for which the agent or company are not there. as such, neither of these two columns
🆕 New Concept Discovered: Statistical Inference Workflow
Structuring a rigorous data analysis plan before modeling
Step 1: Understand the Workflow Requirements
The assignment outlines a structured four-phase workflow for statistical inference to ensure the validity and truthfulness of your findings:
- (i) Pre-Specified Plan: Choose your predictors and write down your hypotheses before looking at the results or running models. This prevents "p-hacking" (testing many models until you find a significant result by chance).
- (ii) Descriptive Statistics and Data Manipulation: Clean the dataset (e.g., handle missing values, address skewness, convert variable types) and explore its characteristics.
- (iii) Primary Analysis: Fit the exact model specified in your Pre-Specified Plan.
- (iv) Secondary Analysis: Perform sensitivity analyses or stress tests (e.g., checking model assumptions, trying alternative specifications) to see how robust your primary findings are.
---
Step 2: Analyze the Dataset Characteristics
The dataset is hotel_bookings.csv, containing approximately 120,000 observations of hotel bookings in Portugal from July 2015 to August 2017.
Key data cleaning notes provided:
- Missing Values: The columns
agentandcompanycontain the string"NULL"instead of actual missing value indicators (NA). - Action: You should avoid using
agentandcompanyas predictors due to these missingness issues, or ensure they are properly cleaned/re-coded before analysis.
---
Step 3: Plan for Question 1 and Question 2
To complete the assignment successfully, you must set up two distinct analyses:
Question 1: Continuous Outcome Variable
- Outcome Variable (\(Y\)): Must be quantitative (e.g.,
adr- Average Daily Rate, orlead_time- number of days between booking and arrival). - Model Requirement: Must include at least one interaction term (e.g., checking if the effect of customer type on price depends on whether they booked a resort hotel or a city hotel:
customer_type * hotel).
Question 2: Binary Outcome Variable
- Outcome Variable (\(Y\)): Must be binary (e.g.,
is_canceledwhere \(1 =\) canceled, \(0 =\) not canceled). - Model Requirement: Typically a logistic regression model to estimate the probability of the binary outcome occurring.
Snap & solve any problem in the app
Get step-by-step solutions on Sovi AI
Photo-based solutions with guided steps
Explore more problems and detailed explanations
To proceed with your assignment, follow these immediate next steps:
- Load and inspect the dataset in your statistical software (R or Python) to view all available column names.
- Select your variables for both questions:
- For Question 1: Select a continuous outcome (like
adr) and at least two predictors to form an interaction term. - For Question 2: Select a binary outcome (like
is_canceled) and your chosen predictors.
- Write down your Pre-Specified Plan (Phase i) including your formal null and alternative hypotheses for both questions before running any regression models.