--- title: "**levelSets** Example: Classifiers" author: "Richard Raubertas" date: "03 February 2026" output: pdf_document: toc: true toc_depth: 3 number_sections: true html_document: toc: true toc_depth: 3 number_sections: true vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Classifiers} %\VignetteEncoding{UTF-8} --- ```{r, include=FALSE} knitr::opts_chunk$set(echo=TRUE, fig.align="center") has_ggplot <- requireNamespace("ggplot2", quietly=TRUE) has_rf <- requireNamespace("randomForest", quietly=TRUE) has_peng <- requireNamespace("palmerpenguins", quietly=TRUE) ``` ```{r, echo=FALSE, eval=has_ggplot} ggplot2::theme_update(plot.background=ggplot2::element_rect(fill=NA)) ``` # Introduction A classifier is a rule that uses one or more features of an item to assign the item to one of a set of $K$ possible classes or groups. A classifier effectively partitions the space of possible feature vectors $x$ into regions, where each region is associated with one class: an item whose feature vector belongs to a region associated with class $k$ is assigned to that class. We may want to understand and visualize those regions of feature space. The output of many classifiers takes the form of a vector $p = (p_1, p_2, \ldots, p_K)$ where $p_k$ is the classifier's reported probability that the item belongs to class $k$. (Less refined classifiers may discretize those probabilities to 1 for a single class and 0 for all the others.) If we focus on a single class $k$ of interest, its regions can be interpreted as a level set, namely the set of $x$ vectors in $d$-dimensional feature space for which $p_k$ is greater than or equal to some threshold probability. To illustrate this application of level sets, we use a random forest classifier [Breiman 2001, Liaw and Wiener 2002] to classify penguins into one of three species based on physical characteristics. The data set is from the `palmerpenguins` package [Horst et al 2020] and is described in `?palmerpenguins::penguins`. There are 333 penguins with complete data, from three species: 146 Adelie (44\%), 68 Chinstrap (20\%), and 119 Gentoo (36\%). For each animal there are seven features, of which two (`island` and `year`) pertain to where and when the animal was measured, and the other five are attributes of the animal itself. Here the modeling focuses on those five, and on discriminating Adelie penguins from the other two species. ```{r, eval=TRUE, echo=FALSE} if (!has_rf) { message("Can't create vignette 'levelSets Example: Classifiers' because ", "package 'randomForest' is not available") knitr::knit_exit() } else if (!has_peng) { message("Can't create vignette 'levelSets Example: Classifiers' because ", "package 'palmerpenguins' is not available") knitr::knit_exit() } ``` # Problem setup ## Develop the random forest classifier ```{r} penguin_data <- stats::na.omit(as.data.frame(palmerpenguins::penguins)) # Shorter names for feature (input) space dimensions: names(penguin_data) <- sub("_length_mm", "_len", names(penguin_data)) names(penguin_data) <- sub("_depth_mm", "_dep", names(penguin_data)) names(penguin_data) <- sub("body_mass_g", "weight", names(penguin_data)) penguin_data$sex <- ifelse(penguin_data$sex == "female", 1, 2) # now numeric # Features to use for modeling: pvars <- c("bill_len", "bill_dep", "flipper_len", "weight", "sex") set.seed(1) rf1 <- randomForest::randomForest(penguin_data[, pvars], y=penguin_data[, "species"], importance=TRUE) print(rf1) randomForest::varImpPlot(rf1) ``` The classifier appears to very accurate, since the estimated (OOB, "out-of-bag") error rate is only 1.8\%. `bill_len` is the most important predictor, `sex` the least. ## Specify the response function and input space The response function in this case is essentially a wrapper for the `predict` method for `randomForest` objects. Since the class (species) of interest is Adelie, we are interested in the level set where the reported probability of being that species is high. ```{r} respfn <- function(x, model) { if (nrow(x) > 0) { predprobs <- predict(model, newdata=x, type="prob") # 3-column matrix # of class probabilities unname(predprobs[, "Adelie"]) } else numeric(0) # 'predict.randomForest' does not like empty 'newdata' } ``` We don't want to extrapolate the classifier model beyond the range of the data. So set the feasible region of input space to be the bounding box of the input data. Also, random forest models are inherently discontinuous functions of the inputs, so turn off attempts to use derivatives. ```{r} feasbnds <- apply(data.matrix(penguin_data[, pvars]), 2, range) library(levelSets) fobj <- fnObj(pvars, respfn=respfn, feasbnds=feasbnds, model=rf1, derivmethod="none") d <- inpDim(fobj) # 5 ``` ## Search region and input scaling The level set search region is set to be a slightly expanded version of the feasible region. (Expansion is recommended because it allows the search to clearly identify points where the level set touches the boundary of the feasible region. This avoids ambiguous "censored" boundary points.) Also set the scaling of inputs to reflect the differing ranges of the features. ```{r} rect0 <- boundingRect(feasbnds, expand=1.1, spec=fobj) scl0 <- inpScale(feasbnds[2, ] - feasbnds[1, ], spec=fobj) ``` ## Define the level set of interest We want to identify regions of input space where the reported probability of being an Adelie penguin is high. (Note that their proportion in the training data is 0.44.) ```{r} thresh <- 0.75 ``` # Profiling the response function along some rays First we examine profiles of the reported Adelie probability along rays parallel to each coordinate axis. The ray origin (position 0) is the reference point of 'rect0', namely its center. Position 1 along each ray is its intersection with the boundary of 'rect0'. ```{r} rays0 <- axisRays(fobj, rect=rect0, scale=scl0) profs <- respProfiles(fobj, rays=rays0, posns=c(-10:10)/10, lsetthresh=thresh) summary(profs) ``` ```{r, eval=requireNamespace("ggplot2", quietly=TRUE)} plot(profs, groupBy=inpNames(fobj)) # 'groupBy' puts each ray in a separate panel ``` ```{r, eval=!requireNamespace("ggplot2", quietly=TRUE), echo=FALSE} message("'ggplot2' package for plotting 'respProfiles' is not available") ``` Short `bill_len` is strongly associated with a high probability of species being Adelie. # Identifying the high-probability Adelie region of feature space We use the `bdrySearch()` function, and select the initial ray origin from among the profile points in `profs` that belong to the level set. ```{r} profinfo <- respInfo(profs) # data frame, including column 'lset' print(subset(profinfo, lset)) ``` All nine level set points are from the `bill_len` profile (`line = 1`). We choose point 8 in `profs` because it is reasonably close to the center of `rect` (`posn` not too far from the ray origin at position 0) but not actually on the boundary of the level set. ```{r} srch1 <- bdrySearch(fobj, lsetthresh=thresh, rect=rect0, scale=scl0, initOrigin=ptCoord(profs)[8, ]) segs1 <- lsetSegs(srch1) ``` To gain some confidence that the line segments in `segs1` do in fact lie entirely in the level set, and that the level set does not have multiple parts, evaluate the response function at a number of points within each segment: ```{r} chk <- lsetSegsCheck(segs1, fnobj=fobj, posns=(1:4)/5) table(chk$validIn) # all TRUE ``` Summarize and plot the segments: ```{r} summary(segs1) pairs(segs1, rect=rect0, main=paste0("Segments in feature space with ", "Pr(Adelie) >= ", thresh)) ``` A level set segment was found on each of the 125 rays. Some things to note: - Most segment endpoints (179 of 250) are on the boundary of the feasible region. - The level set is approximately rectangular when projected onto any pair of `bill_len`, `bill_dep`, and `flipper_len`. - There are no obvious patterns in the level set with respect to `weight`. - This package requires all inputs to the response function to be numeric, so the feature `sex` was coded as 1 (female) or 2 (male) when developing the classifier `rf1`. There are segments that end at values between 1 and 2, which doesn't make sense for classifying penguins. A way to address this is to slice the level set analysis, as shown in the next section. # Slicing the level set Since `sex` is a categorical feature it is logical to examine the level set separately for females and males. We will also slice `weight` at three values (low/medium/high = 3000/4500/6000 grams). And the boundary search will build on the results already found so far (`currentBdry=srch1`). Control option `axisDegree` is set to 2, rather than the default of 1, in order to increase the number of rays per origin at each search step. (See `?srchControl`.) ```{r} slices2 <- sliceMat(list(sex=c(1, 2), weight=c(3000, 4500, 6000)), spec=fobj) # 2 * 3 = 6 slices srch2 <- slicedBdrySearch(fobj, lsetthresh=thresh, rect=rect0, scale=scl0, slices=slices2, currentBdry=srch1, control=list(axisDegree=2)) ``` The result `srch2` is a list with one component per slice. Each component is itself a list like the one returned by `bdrySearch()`. The function `plotSegsList()` can plot the results, putting each slice into a separate panel (package `ggplot2` is required). ```{r} if (requireNamespace("ggplot2", quietly=TRUE)) { plt <- plotSegsList(srch2, dims=c("bill_len", "bill_dep"), rect=feasbnds, wrap=list(ncol=2)) print(plt + ggplot2::labs(title=paste0("Pr(Adelie) >= ", thresh, " (bill_dep vs bill_len)"))) plt <- plotSegsList(srch2, dims=c("bill_len", "flipper_len"), rect=feasbnds, wrap=list(ncol=2)) print(plt + ggplot2::labs(title=paste0("Pr(Adelie) >= ", thresh, " (flipper_len vs bill_len)"))) plt <- plotSegsList(srch2, dims=c("bill_dep", "flipper_len"), rect=feasbnds, wrap=list(ncol=2)) print(plt + ggplot2::labs(title=paste0("Pr(Adelie) >= ", thresh, " (flipper_len vs bill_dep)"))) } ``` The level set has corners, which is not surprising for a tree-based classifier. Sex has only small effect on the level set, as expected from the variable importance measures shown earlier. Weight has a noticable effect, with the level set extending to small values of `bill_dep` for animals with the smallest weights. # References {-} Breiman, L (2001). Random forests. _Machine Learning_ *45*(1), 5-32. Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi:10.5281/zenodo.3960218. Liaw A, Wiener M (2002). Classification and regression by randomForest. _R News_, *2*(3), 18-22. .