Regression with a Categorical Independent Variable
Regression
GLM
Incorporating categorical predictors (i.e., factor variables) into a regression model using dummy variables.
General Principles
To study the relationship between a categorical independent variable and a continuous dependent variable, we use a Categorical model which applies stratification.
Stratification involves modeling how the k different categories of the independent variable affect the target continuous variable by performing a regression for each k category and assigning a regression coefficient for each category. To implement stratification, categorical variables are often encoded using one-hot encoding π or by converting categories to indices π.
As we generate regression coefficients for each k category, we need to specify a prior with a shape equal to the number of categories k in the code (see comments in the code).
To compare differences between categories, we need to compute the distribution of the differences between categories, known as the contrast distribution. Never compare confidence intervals or p-values directly.
Example
Below is an example of code that demonstrates Bayesian regression with an independent categorical variable using the Bayesian Inference (BI) package. The data consist of one continuous dependent variable (kcal_per_g), representing the caloric value of milk per gram, a categorical independent variable (index_clade), representing species clade membership, and a continuous independent variable (mass), representing the mass of individuals in the clade. The goal is to estimate the differences in milk calories between clades. This example is based on McElreath (2018).
from BI import bi# Setup device------------------------------------------------m = bi(platform='cpu')# Import Data & Data Manipulation ------------------------------------------------# Importfrom importlib.resources import filesdata_path = m.load.milk(only_path =True)m.data(data_path, sep=';') m.index(["clade"]) # Convert clade names into indexm.scale(['kcal_per_g']) # Scale# Define model ------------------------------------------------def model(kcal_per_g, index_clade, mass): a = m.dist.normal(0, 0.5, shape=(4,), name ='a') # shape based on the number of clades b = m.dist.normal(0, 0.5, shape=(4,), name ='b') s = m.dist.exponential( 1, name ='s') mu = a[index_clade]+b[index_clade]*mass m.dist.normal(mu, s, obs=kcal_per_g)# Run mcmc ------------------------------------------------m.fit(model) # Optimize model parameters through MCMC sampling# Summary ------------------------------------------------m.summary()
library(BayesianInference)m=importBI(platform='cpu')# Load csv filem$data(m$load$milk(only_path = T), sep=';')m$scale(list('kcal.per.g')) # Manipulatem$index(list('clade')) # Scalem$data_to_model(list('kcal_per_g', 'index_clade')) # Send to model (convert to jax array)# Define model ------------------------------------------------model <-function(kcal_per_g, index_clade){# Parameter prior distributions beta =bi.dist.normal( 0, 0.5, name ='beta', shape =c(4)) # shape based on the number of clades sigma =bi.dist.exponential(1, name ='s')# Likelihoodbi.dist.normal(beta[index_clade], sigma, obs=kcal_per_g)}# Run mcmc ------------------------------------------------m$fit(model) # Optimize model parameters through MCMC sampling# Summary ------------------------------------------------m$summary() # Get posterior distributions
usingBayesianInference# Setup device------------------------------------------------m =importBI(platform="cpu")# Import Data & Data Manipulation ------------------------------------------------# Importdata_path = m.load.milk(only_path =true)m.data(data_path, sep=';')m.index("clade") # Convert clade names into indexm.scale(["kcal_per_g"]) # Scale# Define model ------------------------------------------------@BIfunctionmodel(kcal_per_g, index_clade, mass) a = m.dist.normal(0, 0.5, shape=(4,), name ="a") # shape based on the number of clades b = m.dist.normal(0, 0.5, shape=(4,), name ="b") s = m.dist.exponential( 1, name ='s') mu = a[index_clade]+b[index_clade]*mass m.dist.normal(mu, s, obs=kcal_per_g)end# Run mcmc ------------------------------------------------m.fit(model) # Optimize model parameters through MCMC sampling# Summary ------------------------------------------------m.summary() # Get posterior distributions
Caution
For R users, when working with indices you have to ensure 1) that indices are intergers (i.e. as.integer(index_clade)) and, 2) that indices start at 0 (i.e. as.integer(index_clade)-1).
Mathematical Details
Frequentist formulation
We model the relationship between the categorical input feature (X) and the target variable (Y) using the following equation:
Y_i = \alpha + \beta_k X_i + \sigma
Where:
Y_i is the dependent variable for observation i.
\alpha is the intercept term.
\beta_k are the regression coefficients for each k category.
X_i is the encoded categorical input variable for observation i.
\sigma is the error term.
We can interpret \beta_i as the effect of each category on Y relative to the baseline (usually one of the categories or the intercept).
Bayesian formulation
In the Bayesian formulation, we define each parameter with priors π. We can express the Bayesian regression model accounting for prior distributions as follows:
Y \sim \text{Normal}(\alpha + \beta_K X, \sigma)
\alpha \sim \text{Normal}(0,1)
\beta_K \sim \text{Normal}(0,1)
\sigma \sim \text{Exponential}(1)
Where:
Y_i is the dependent variable for observation i.
\alpha is the intercept term, which in this case has a unit-normal prior.
\beta_K are slope coefficients for the K distinct independent variables categories, which also have unit-normal priors.
X_i is the encoded categorical input variable for observation i.
\sigma is a standard deviation parameter, which here has a Exponential prior that constrains it to be positive.