Inferential Statistics; Making Inferences with Data in R

A guide for hypothesis testing and regression analysis in R

May 28, 2023

Data Science, R, R Programming, Inferential statistics

Kelvin Mwaka Muia

Introduction to Inferential Statistics using R.

Inferential statistics is a branch of statistics that allows us to draw conclusions about a population based on data collected from a sample. It provides valuable insights into various aspects of data analysis. In this blog post, we will explore the key concepts of inferential statistics and discuss some common techniques used for making informative inferences.

Before conducting inferential statistical analysis, it is important to consider the assumptions specific to each test.

In R, we can leverage among many other options, the infer package, to conduct inferential tests. This package provides functions that facilitate inferential statistics.

Interpreting the results of inferential tests involves several factors. Firstly, the p-value indicates the likelihood of obtaining the results by chance. A low p-value typically \((p < 0.05)\) suggests statistically significant test results.

Secondly, we consider the effect size, which quantifies the strength of the relationship between variables. A large effect size indicates a strong relationship, while a small effect size suggests a weak relationship.

Lastly, we examine the confidence interval, a range likely to contain the true parameter value. A narrow confidence interval signifies a more precise estimate, while a wide interval reflects lower confidence.

The examples below demonstrate hypothesis testing, and regression analysis in R. Each example highlights the steps, results, and their implications.

Dataset.

The dataset to be used in all the sections is the data frame Arthritis installed and loaded with the vcd package. The dataset is described as; “Data from Koch & Edwards (1988) from a double-blind clinical trial investigating a new treatment for rheumatoid arthritis.”.

The descriptive statistics and a visualization of the data are shown below;

#clear workspace
rm(list=ls())
library(dplyr)
library(infer)
##load vcd package
library(vcd)
##load Arthritis dataset (data frame)
data(Arthritis)
#create a cross table for visualization
arthritis_table <- xtabs(~Improved + Treatment + Sex, Arthritis)
ftable(arthritis_table)
##                    Sex Female Male
## Improved Treatment                
## None     Placebo           19   10
##          Treated            6    7
## Some     Placebo            7    0
##          Treated            5    2
## Marked   Placebo            6    1
##          Treated           16    5

From the cross table of the arthritis dataset, we can derive several inferences. The table shows the distribution of individuals by their Sex (Female and Male) and the improvement in their condition (None, Some, or Marked) based on the treatment they received (Treated or Placebo).

The majority of individuals in the dataset are Female, as indicated by the higher counts in the “Female” column compared to the “Male” column. Among the individuals who received no treatment (None), there were more Females (19) than males (10).

For the individuals who received the placebo treatment, there were more Females than males in all three categories of improvement (None, Some, and Marked). However, the difference is particularly notable in the “Some” category, where there were 7 Females and no males.

In the “Treated” group, the counts vary across the different levels of improvement. For example, in the “None” category, there were 6 Females and 7 males. In the “Some” category, there were 5 Females and 2 males. In the “Marked” category, there were 16 Females and 5 males.

#visualize arthritis data using a mosaic plot
mosaic(arthritis_table, shade = TRUE, legend = TRUE, 
       labeling_args = list(set_varnames = c(Sex = "Sex", 
                                             Improved = "Improved", 
                                             Treatment = "Treatment")),
       set_labels = list(Improved = c("None", "Some", "Marked"),
                         Class = c("Placebo", "Treated"),
                         Sex = c("F", "M")),
       main = "Arthritis data")

Hypothesis testing.

This inference method compares groups to determine if they are statistically different.

Statistical inference on the `Age` numerical variable (Mean).

Computing the observed mean statistic of the Age variable;

obs_age_mean <- Arthritis |>
  observe(response = Age, stat = "mean")
obs_age_mean
## Response: Age (numeric)
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  53.4

Then, generating the null distribution and Visualizing the observed statistic alongside the null distribution;

set.seed(123) #for reproducible results
null_dist <- Arthritis |>
  specify(response = Age) |>
  hypothesize(null = "point", mu = 50) |>
  generate(reps = 1000) |>
  calculate(stat = "mean")
visualize(null_dist) +
  shade_p_value(obs_stat = obs_age_mean, direction = "two-sided")

Calculating the p-value from the null distribution and observed statistic;

null_dist |>
  get_p_value(obs_stat = obs_age_mean, direction = "two-sided")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1   0.014

Statistical inference on the `Sex` categorical variable (Proportion).

Computing the observed statistic, using the observe() wrapper;

prop_g <- Arthritis |>
  observe(response = Sex, 
          success = "Female", stat = "prop")
prop_g
## Response: Sex (factor)
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1 0.702

Then, generating the null distribution and Visualizing the observed statistic alongside the null distribution;

set.seed(123)#for reproducibility
null_dist <- Arthritis |>
  specify(response = Sex, success = "Female") |>
  hypothesize(null = "point", p = .5) |>
  generate(reps = 1000) |>
  calculate(stat = "prop")
visualize(null_dist) +
  shade_p_value(obs_stat = prop_g, direction = "two-sided")

Calculating the p-value from the null distribution and observed statistic;

null_dist |>
  get_p_value(obs_stat = prop_g, direction = "two-sided")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

Sex and Treatment categorical variables (difference in proportions test).

Here, we test difference in proportions between the Female and Male genders in the Treatment variable.

Computing the observed statistic using the observe() wrapper;

diff_in_g_props <- Arthritis |> 
  observe(Treatment ~ Sex, success = "Treated", 
          stat = "diff in props", order = c("Female", "Male"))
diff_in_g_props
## Response: Treatment (factor)
## Explanatory: Sex (factor)
## # A tibble: 1 × 1
##     stat
##    <dbl>
## 1 -0.102

Then, generating the null distribution and Visualizing the observed statistic alongside the null distribution;

set.seed(123)
null_dist <- Arthritis |>
  specify(Treatment ~ Sex, success = "Treated") |>
  hypothesize(null = "independence") |> 
  generate(reps = 1000) |> 
  calculate(stat = "diff in props", order = c("Female", "Male"))
visualize(null_dist) +
  shade_p_value(obs_stat = diff_in_g_props, direction = "two-sided")

Calculating the p-value from the null distribution and observed statistic;

null_dist |>
  get_p_value(obs_stat = diff_in_g_props, direction = "two-sided")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1   0.538

Statistical inference on Age and Sex (difference in grouped means).

Calculating the observed statistic using the observe() wrapper;

diff_in_age_sex_mean <- Arthritis |> 
  observe(Age ~ Sex,
          stat = "diff in means", order = c("Female", "Male"))

Then, generating the null distribution and Visualizing the observed statistic alongside the null distribution;

null_dist <- Arthritis |>
  specify(Age ~ Sex) |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  calculate(stat = "diff in means", order = c("Female", "Male"))
visualize(null_dist) +
  shade_p_value(obs_stat = diff_in_age_sex_mean, direction = "two-sided")

Calculating the p-value from the null distribution and observed statistic;

null_dist |>
  get_p_value(obs_stat = diff_in_age_sex_mean, direction = "two-sided")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1     0.9

Regression analysis.

This method is used when we need to predict the value of one variable based on another.

OLS model on the `Improved` variable.

library(ExPanDaR)
library(htmltools)
fixed_model <- prepare_regression_table(Arthritis, dvs = "Improved", idvs = c("Sex", "Treatment", "Age"), models = "ols")
HTML(fixed_model$table)

Conclusion.

Inferential statistics is a powerful tool for drawing conclusions about populations based on samples. It requires careful consideration of assumptions, accurate test implementation, and correct result interpretation.

Watch out for my next posts about correlation analysis, confidence intervals and regression analysis!

Introduction to Inferential Statistics using R.

Dataset.

Hypothesis testing.

Statistical inference on the Age numerical variable (Mean).

Statistical inference on the Sex categorical variable (Proportion).

Sex and Treatment categorical variables (difference in proportions test).

Statistical inference on Age and Sex (difference in grouped means).

Regression analysis.

OLS model on the Improved variable.

Conclusion.

Statistical inference on the `Age` numerical variable (Mean).

Statistical inference on the `Sex` categorical variable (Proportion).

OLS model on the `Improved` variable.