Step-by-Step FAMD Tutorial with R and Python ExamplesFactor Analysis of Mixed Data (FAMD) is a dimension-reduction technique designed for datasets containing both continuous and categorical variables. It blends ideas from Principal Component Analysis (PCA) for numeric data and Multiple Correspondence Analysis (MCA) for categorical data, producing factors that summarize the most important patterns across mixed-variable datasets. This tutorial walks through the theory, practical considerations, and step-by-step implementations in both R and Python with reproducible examples.
When to use FAMD
FAMD is appropriate when:
- Your dataset contains a mix of continuous (numeric) and categorical (nominal or ordinal) variables.
- You want to reduce dimensionality for visualization, clustering, or exploratory analysis.
- You need components that reflect joint structure across variable types rather than transforming all variables to a single type (e.g., via dummy encoding without taking scale into account).
Key advantage: FAMD treats continuous and categorical variables so that each variable contributes equally to the analysis — continuous variables are centered and scaled, categorical variables are encoded as indicator (dummy) variables and weighted so that each original categorical variable contributes the same total weight as one continuous variable.
The idea behind FAMD (brief)
- Continuous variables: centered and scaled (like PCA).
- Categorical variables: converted to a set of binary indicator variables; each category is weighted inversely to its frequency (like MCA).
- The analysis finds orthogonal axes maximizing the explained variance across these transformed variables.
Mathematically, FAMD can be seen as performing a singular value decomposition (SVD) on a suitably standardized data matrix that combines continuous variables and dummy-coded categorical variables with appropriate weights.
Preparing your data
- Handle missing values (impute or remove rows/columns).
- Ensure categorical variables are coded as factors ® or pandas categorical/dtype (Python).
- Decide whether to exclude low-frequency categories or combine levels.
R: Step-by-step FAMD with FactoMineR and factoextra
Packages used:
- FactoMineR — implements FAMD.
- factoextra — visualization helpers.
- tidyverse — data manipulation (optional).
Install (if needed):
install.packages(c("FactoMineR","factoextra","tidyverse"))
- Load libraries and example dataset “`r library(FactoMineR) library(factoextra) library(tidyverse)
Example: use the built-in poison
dataset from FactoMineR or create a toy dataset
data(iris) # iris has only numeric vars; we’ll add a categorical iris2 <- iris %>% mutate(Species = as.factor(Species),
Size = cut(Sepal.Length, breaks = 3, labels = c("Small","Medium","Large")))
2) Run FAMD ```r res.famd <- FAMD(iris2, ncp = 5, graph = FALSE)
- Inspect results
- Eigenvalues (variance explained)
res.famd$eig
- Variable contributions and coordinates
res.famd$var$coord # continuous variable coordinates res.famd$quali.var$coord # categorical variable categories coordinates res.famd$ind$coord # individual (row) coordinates
- Visualize
- Scree plot (eigenvalues)
fviz_screeplot(res.famd, addlabels = TRUE)
- Variable plot (continuous + categories)
fviz_famd_var(res.famd, repel = TRUE)
- Individuals colored by a categorical variable
fviz_famd_ind(res.famd, habillage = "Species", palette = "jco", addEllipses = TRUE, repel = TRUE)
- Interpret components
- Look at variables with high contributions to each dimension:
fviz_contrib(res.famd, choice = "var", axes = 1, top = 10)
- Examine category positions in the factor space to interpret how levels relate to components.
-
Use component scores for downstream tasks
scores <- res.famd$ind$coord # e.g., clustering kmeans(scores[,1:3], centers = 3)
Python: Step-by-step FAMD with prince and scikit-learn
Packages used:
- prince — provides FAMD implementation.
- pandas, numpy, matplotlib, seaborn — utilities and plotting.
- scikit-learn — for downstream tasks (clustering, classifiers).
Install:
pip install pandas numpy matplotlib seaborn scikit-learn prince
- Load libraries and prepare data “`python import pandas as pd import numpy as np import prince import matplotlib.pyplot as plt import seaborn as sns from sklearn.preprocessing import KBinsDiscretizer from sklearn.cluster import KMeans
Example: iris dataset, add a categorical binned variable
from sklearn.datasets import load_iris iris_skl = load_iris(as_frame=True) df = iris_skl.frame df[‘Size’] = pd.cut(df[‘sepal length (cm)’], bins=3, labels=[‘Small’,‘Medium’,‘Large’])
Ensure dtypes
df[‘Size’] = df[‘Size’].astype(‘category’) df[‘target’] = df[‘target’].astype(‘category’)
2) Run FAMD ```python famd = prince.FAMD(n_components=5, n_iter=3, copy=True, check_input=True, random_state=42) famd = famd.fit(df)
Note: prince.FAMD expects all columns (numeric and categorical) in the DataFrame. It will automatically handle categorical dtypes.
- Inspect results
-
Eigenvalues / explained inertia
eigenvalues = famd.eigenvalues_ explained_inertia = famd.explained_inertia_
-
Row coordinates (factors)
row_coords = famd.row_coordinates(df) # Column coordinates col_coords = famd.column_coordinates(df)
- Visualize
- Scree plot
plt.plot(np.cumsum(explained_inertia)) plt.xlabel('Number of components') plt.ylabel('Cumulative explained inertia') plt.show()
- Scatter of first two components
coords = row_coords.iloc[:, :2] coords = coords.join(df['target']) sns.scatterplot(data=coords, x=0, y=1, hue='target', palette='deep') plt.xlabel('Dim 1'); plt.ylabel('Dim 2') plt.show()
- Interpretation & downstream use
- Contributions: prince provides column contributions; for per-variable contributions you may need to aggregate category contributions for categorical variables.
col_contrib = famd.column_correlations(df) # approximate guidance; check prince docs
- Use row_coords as features for clustering or classification:
kmeans = KMeans(n_clusters=3, random_state=42).fit(row_coords.iloc[:, :3])
Practical tips and pitfalls
- Missing values: FactoMineR handles some missing data via imputation options; in Python, impute before FAMD.
- Scaling: Continuous variables are standardized automatically in FAMD; do not scale again.
- Rare categories: Very rare levels can dominate MCA-type weighting; consider combining or removing infrequent categories.
- Interpretability: The direction of axes is arbitrary; focus on relative positions and variable contributions rather than sign.
- Number of components: Use a scree plot and cumulative explained inertia; often 2–4 components suffice for visualization.
Quick comparison: R (FactoMineR) vs Python (prince)
Feature | R — FactoMineR | Python — prince |
---|---|---|
Mature implementation | Yes | Less mature |
Visualization helpers | factoextra | Manual with seaborn/matplotlib |
Handling of missing data | Built-in options | Requires pre-imputation |
Community examples/tutorials | Many | Fewer |
Example: Full reproducible R script
# Full script: FAMD with iris2 library(FactoMineR); library(factoextra); library(tidyverse) data(iris) iris2 <- iris %>% mutate(Species = as.factor(Species), Size = cut(Sepal.Length, breaks = 3, labels = c("Small","Medium","Large"))) res.famd <- FAMD(iris2, ncp = 5, graph = FALSE) print(res.famd$eig) fviz_screeplot(res.famd, addlabels = TRUE) fviz_famd_ind(res.famd, habillage = "Species", addEllipses = TRUE, repel = TRUE)
Example: Full reproducible Python script
import pandas as pd import numpy as np import prince import matplotlib.pyplot as plt import seaborn as sns from sklearn.datasets import load_iris from sklearn.cluster import KMeans iris = load_iris(as_frame=True) df = iris.frame df['Size'] = pd.cut(df['sepal length (cm)'], bins=3, labels=['Small','Medium','Large']) df['Size'] = df['Size'].astype('category') df['target'] = df['target'].astype('category') famd = prince.FAMD(n_components=5, random_state=42) famd = famd.fit(df) row_coords = famd.row_coordinates(df) explained = famd.explained_inertia_ plt.plot(np.cumsum(explained)); plt.xlabel('n components'); plt.ylabel('cumulative inertia'); plt.show() sns.scatterplot(data=row_coords.join(df['target']), x=0, y=1, hue='target'); plt.show() kmeans = KMeans(n_clusters=3, random_state=42).fit(row_coords.iloc[:, :3])
Further reading and next steps
- Explore rotations or varimax on continuous loadings if interpretability is important.
- Use FAMD scores as features in supervised learning to reduce multicollinearity and noise.
- For very large datasets, consider sampling or specialized scalable algorithms.
This tutorial covered the essentials: intuition, data preparation, R and Python implementations, visualization, interpretation, and practical tips.