Categorical Dependent and Numeric/Continuous Independent Variables_Applied Supervised Learning with R-QQ阅读中文科幻网

上QQ阅读APP看书，第一时间看更新

Categorical Dependent and Numeric/Continuous Independent Variables

Hypotheses 1 and 2 have a continuous independent variable. Referring to the figure in the previous section, we will opt for the chi-squared test. In the process of hypothesis testing, we start by defining a null hypothesis and an alternate hypothesis. Start with a negative approach, that is, assume the null hypothesis to be what we don't want to happen. The hypothesis test examines the chances that the pattern observed happens due to random chance or there if is certainty about the observation. This measure is quantified as probability. If the probability of the significance of the null hypothesis to happen is less than 5% (or a suitable cut-off), we reject the null hypothesis and confirm the validity of the alternate hypothesis.

Let's begin; for hypothesis 1, we define the following:

Null hypothesis: The campaign outcome has no relationship with the employee variance rate.
Alternate hypothesis: The campaign outcome has a relationship with employee variance rate.

We test the validity of our null hypothesis with simple logistic regression. We will discuss this topic in more detail in the following chapters. For now, we will quickly perform a simple check to test our hypothesis. The following exercise leverages R's built-in function for performing logistic regression.

Exercise 36: Hypothesis 1 Testing for Categorical Dependent Variables and Continuous Independent Variables

To perform hypothesis testing for categorical dependent variables and continuous independent variables, we will use the glm() function to fit the logistic regression model (more on this in Chapter 5, Classification). This exercise will help us statistically test whether a categorical dependent variable (for example, y) has any relationship with a continuous independent variable, for example,

emp.var.rate.

Perform the following steps to complete the exercise:

Import the required libraries and create the DataFrame objects.
First, convert the dependent variable into a factor type:
df$y <- factor(df$y)
Next, perform logistic regression:
h.test <- glm(y ~ emp.var.rate, data = df, family = "binomial")
Print the test summary:
summary(h.test)
The output is as follows:
Call:
glm(formula = y ~ emp.var.rate, family = "binomial", data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0047 -0.4422 -0.3193 -0.2941 2.5150
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.33228 0.01939 -120.31 <2e-16 ***
emp.var.rate -0.56222 0.01018 -55.25 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 28999 on 41187 degrees of freedom
Residual deviance: 25597 on 41186 degrees of freedom
AIC: 25601
Number of Fisher Scoring iterations: 5

We convert the target variable, y, as a factor (if it was not already). We use the glm function provided by R for logistic regression. The glm function also performs other forms of regression, and we specify the family = 'binomial' parameter for using the function as a logistic regression. The formula in the first place of the function defines the dependent and independent variables.

There are quite a few results shared in the output. We will ignore most of them for now and focus only on the final output. One of the results provided is the significance probability, which confirms that there is less than a 2e-16 chance that our null hypothesis is true, and therefore we can reject it. Therefore, the target outcome has a statistically significant relationship with the employee variance rate and, as we can see, there is a higher chance of campaign conversion as the rate decreases.

Similarly, let's repeat the same test for our second hypothesis. We define the following:

Null hypothesis: The campaign outcome has no relationship with the euro interest rate.
Alternate hypothesis: The campaign outcome has a relationship with the euro interest rate.

Exercise 37: Hypothesis 2 Testing for Categorical Dependent Variables and Continuous Independent Variables

Once again, we will use logistic regression to statistically test whether there is a relationship between the target variable, y, and the independent variable. In this exercise, we will use the euribor3m variable.

Perform the following steps:

Import the required libraries and create the DataFrame objects.
First, convert the dependent variable into a factor type:
df$y <- factor(df$y)
Next, perform logistic regression:
h.test2 <- glm(y ~ euribor3m, data = df, family = "binomial")
Print the test summary:
summary(h.test2)
The output is as follows:
Call:
glm(formula = y ~ euribor3m, family = "binomial", data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8568 -0.3730 -0.2997 -0.2917 2.5380
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.472940 0.027521 -17.18 <2e-16 ***
euribor3m -0.536582 0.009547 -56.21 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 28999 on 41187 degrees of freedom
Residual deviance: 25343 on 41186 degrees of freedom
AIC: 25347
Number of Fisher Scoring iterations: 5

Focusing exclusively on the previous output, we can confirm that we can reject the null hypothesis and accept the alternative hypothesis. Therefore, the target outcome has a statistically significant relationship with the Euro Interest rate and, as we can see, there is a higher chance of campaign conversion as the rate decreases.