---
title: "**levelSets** Example:  Classifiers"
author: "Richard Raubertas"
date: "03 February 2026"
output: 
  pdf_document: 
    toc: true
    toc_depth: 3
    number_sections: true
  html_document:
    toc: true
    toc_depth: 3
    number_sections: true
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{Classifiers}
  %\VignetteEncoding{UTF-8}
---

<!-- Vignette 4 for 'levelSets' package -->

<!-- Global options -->
```{r, include=FALSE}
knitr::opts_chunk$set(echo=TRUE, fig.align="center")
has_ggplot <- requireNamespace("ggplot2", quietly=TRUE)
has_rf <- requireNamespace("randomForest", quietly=TRUE)
has_peng <- requireNamespace("palmerpenguins", quietly=TRUE)
```

```{r, echo=FALSE, eval=has_ggplot}
ggplot2::theme_update(plot.background=ggplot2::element_rect(fill=NA))
```

# Introduction

A classifier is a rule that uses one or more features of an item to assign 
the item to one of a set of $K$ possible classes or groups.  A classifier 
effectively 
partitions the space of possible feature vectors $x$ into regions, where each 
region is associated with one class:  an item whose feature vector 
belongs to a region associated with class $k$ is assigned to that class. 
We may want to understand and visualize those regions of feature space.

The output of many classifiers takes the form of a vector $p = 
(p_1, p_2, \ldots, p_K)$ where $p_k$ is the classifier's reported 
probability that the item belongs to class $k$.  (Less refined classifiers 
may discretize those probabilities to 1 for a single class and 0 for 
all the others.)  If 
we focus on a single class $k$ of interest, its regions can be interpreted 
as a level set, namely the set of $x$ vectors in $d$-dimensional feature 
space for which $p_k$ is greater than or equal to some threshold 
probability.

To illustrate this application of level sets, we use a random forest 
classifier [Breiman 2001, Liaw and Wiener 2002] to classify penguins into 
one of three species based on physical characteristics. 
The data set is from the `palmerpenguins` package [Horst et al 2020] and 
is described in 
`?palmerpenguins::penguins`.  There are 333 penguins with complete data, 
from three species: 146 Adelie (44\%), 68 Chinstrap (20\%), and 119 Gentoo 
(36\%).  For each animal there are seven features, of which two (`island` 
and `year`) pertain to where and when the animal was measured, and the 
other five are attributes of the animal itself.  Here the modeling focuses 
on those five, and on discriminating Adelie penguins from the other two 
species. 

```{r, eval=TRUE, echo=FALSE}
if (!has_rf) {
  message("Can't create vignette 'levelSets Example:  Classifiers' because ", 
          "package 'randomForest' is not available")
  knitr::knit_exit()
} else if (!has_peng) {
  message("Can't create vignette 'levelSets Example:  Classifiers' because ", 
          "package 'palmerpenguins' is not available")
  knitr::knit_exit()
}
```

# Problem setup

## Develop the random forest classifier

```{r}
penguin_data <- stats::na.omit(as.data.frame(palmerpenguins::penguins))
# Shorter names for feature (input) space dimensions:
names(penguin_data) <- sub("_length_mm", "_len", names(penguin_data))
names(penguin_data) <- sub("_depth_mm", "_dep", names(penguin_data))
names(penguin_data) <- sub("body_mass_g", "weight", names(penguin_data))
penguin_data$sex <- ifelse(penguin_data$sex == "female", 1, 2)  # now numeric
# Features to use for modeling:
pvars <- c("bill_len", "bill_dep", "flipper_len", "weight", "sex")

set.seed(1)
rf1 <- randomForest::randomForest(penguin_data[, pvars], 
                                  y=penguin_data[, "species"], 
                                  importance=TRUE)
print(rf1)
randomForest::varImpPlot(rf1)
```

The classifier appears to very accurate, since the 
estimated (OOB, "out-of-bag") error rate is only 1.8\%. 
`bill_len` is the most important predictor, `sex` the least.

## Specify the response function and input space

The response function in this case is essentially a wrapper for the 
`predict` method for `randomForest` objects.  Since the class 
(species) of interest is Adelie, we are interested in the level set 
where the reported probability of being that species is high.

```{r}
respfn <- function(x, model) {
  if (nrow(x) > 0) {
    predprobs <- predict(model, newdata=x, type="prob")  # 3-column matrix
                                              # of class probabilities
    unname(predprobs[, "Adelie"])
  } else  numeric(0)  # 'predict.randomForest' does not like empty 'newdata'
}
```

We don't want to extrapolate the classifier model beyond the range of 
the data.  So set the feasible region of input space to be the bounding 
box of the input data.  Also, random forest models are inherently 
discontinuous functions of the inputs, so turn off attempts to use 
derivatives.

```{r}
feasbnds <- apply(data.matrix(penguin_data[, pvars]), 2, range)
library(levelSets)
fobj <- fnObj(pvars, respfn=respfn, feasbnds=feasbnds, model=rf1, 
              derivmethod="none")
d <- inpDim(fobj)  # 5
```

## Search region and input scaling

The level set search region is set to be a slightly expanded version of the 
feasible region.  (Expansion is recommended because it allows the search 
to clearly identify points where 
the level set touches the boundary of the feasible region.  This avoids 
ambiguous "censored" boundary points.)  Also set the scaling of inputs 
to reflect the differing ranges of the features.

```{r}
rect0 <- boundingRect(feasbnds, expand=1.1, spec=fobj)
scl0 <- inpScale(feasbnds[2, ] - feasbnds[1, ], spec=fobj)
```

## Define the level set of interest

We want to identify regions of input space where the reported probability 
of being an Adelie penguin is high.  (Note that their proportion in the 
training data is 0.44.)

```{r}
thresh <- 0.75
```

# Profiling the response function along some rays

First we examine profiles of the reported Adelie probability along 
rays parallel to each coordinate axis.  The ray origin (position 0) is the 
reference point of 'rect0', namely its center.  Position 1 along each ray 
is its intersection with the boundary of 'rect0'.

```{r}
rays0 <- axisRays(fobj, rect=rect0, scale=scl0)
profs <- respProfiles(fobj, rays=rays0, posns=c(-10:10)/10, 
                      lsetthresh=thresh)
summary(profs)
```

```{r, eval=requireNamespace("ggplot2", quietly=TRUE)}
plot(profs, groupBy=inpNames(fobj))  # 'groupBy' puts each ray in a separate panel
```

```{r, eval=!requireNamespace("ggplot2", quietly=TRUE), echo=FALSE}
message("'ggplot2' package for plotting 'respProfiles' is not available")
```

Short `bill_len` is strongly associated with a high probability of species 
being Adelie.

# Identifying the high-probability Adelie region of feature space

We use the `bdrySearch()` function, and select the initial 
ray origin from among the profile points in `profs` that belong to the 
level set.

```{r}
profinfo <- respInfo(profs)  # data frame, including column 'lset'
print(subset(profinfo, lset))
```

All nine level set points are from the `bill_len` profile (`line = 1`). 
We choose point 8 in `profs` because it is reasonably close to the center of 
`rect` (`posn` not too far from the ray origin at position 0) but not 
actually on the boundary of the level set. 

```{r}
srch1 <- bdrySearch(fobj, lsetthresh=thresh, rect=rect0, scale=scl0, 
                    initOrigin=ptCoord(profs)[8, ])
segs1 <- lsetSegs(srch1)
```

To gain some confidence that the line segments in `segs1` do in fact lie 
entirely in 
the level set, and that the level set does not have multiple parts, 
evaluate the response function at a number of points within 
each segment:

```{r}
chk <- lsetSegsCheck(segs1, fnobj=fobj, posns=(1:4)/5)
table(chk$validIn)  # all TRUE
```

Summarize and plot the segments:

```{r}
summary(segs1)
pairs(segs1, rect=rect0, main=paste0("Segments in feature space with ", 
                                     "Pr(Adelie) >= ", thresh))
```

A level set segment was found on each of the 125 rays.  Some things to note: 

- Most segment endpoints (179 of 250) are on the boundary of the feasible 
region.

- The level set is approximately rectangular when projected onto any pair 
of `bill_len`, `bill_dep`, and `flipper_len`. 

- There are no obvious patterns in the level set with respect to `weight`.

- This package requires all inputs to the response function to be numeric, 
so the feature `sex` was coded as 1 (female) or 2 (male) when developing 
the classifier `rf1`.  There are segments that end at values between 
1 and 2, which doesn't make sense for classifying penguins.  A way to 
address this is to slice the level set analysis, as shown in the next section.

# Slicing the level set

Since `sex` is a categorical feature it is logical to examine the level 
set separately for females and males.  We will also slice 
`weight` at three values (low/medium/high = 3000/4500/6000 grams).  And the 
boundary search will build on the results already found so far 
(`currentBdry=srch1`).  Control option `axisDegree` is set to 2, rather 
than the default of 1, in order to increase the number of rays per 
origin at each search step.  (See `?srchControl`.)

```{r}
slices2 <- sliceMat(list(sex=c(1, 2), weight=c(3000, 4500, 6000)), 
                    spec=fobj)  # 2 * 3 = 6 slices
srch2 <- slicedBdrySearch(fobj, lsetthresh=thresh, rect=rect0, scale=scl0, 
                          slices=slices2, currentBdry=srch1, 
                          control=list(axisDegree=2))
```

The result `srch2` is a list with one component per slice.  Each component 
is itself a list like the one returned by `bdrySearch()`.  The function 
`plotSegsList()` can plot the results, putting each slice into a separate 
panel (package `ggplot2` is required).

```{r}
if (requireNamespace("ggplot2", quietly=TRUE)) {
  plt <- plotSegsList(srch2, dims=c("bill_len", "bill_dep"), rect=feasbnds, 
                      wrap=list(ncol=2))
  print(plt + ggplot2::labs(title=paste0("Pr(Adelie) >= ", thresh, 
                                         " (bill_dep vs bill_len)")))
  plt <- plotSegsList(srch2, dims=c("bill_len", "flipper_len"), rect=feasbnds, 
                      wrap=list(ncol=2))
  print(plt + ggplot2::labs(title=paste0("Pr(Adelie) >= ", thresh, 
                                         " (flipper_len vs bill_len)")))
  plt <- plotSegsList(srch2, dims=c("bill_dep", "flipper_len"), rect=feasbnds, 
                      wrap=list(ncol=2))
  print(plt + ggplot2::labs(title=paste0("Pr(Adelie) >= ", thresh, 
                                         " (flipper_len vs bill_dep)")))
}
```

The level set has corners, which is not surprising for a tree-based 
classifier.  Sex 
has only small effect on the level set, as expected from the variable 
importance measures shown earlier.  Weight has a noticable effect, with the 
level set extending to small values of `bill_dep` for animals with the 
smallest weights.

# References {-}

Breiman, L (2001). Random forests. _Machine Learning_ *45*(1), 5-32.

Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer
Archipelago (Antarctica) penguin data. R package version 0.1.0.
https://allisonhorst.github.io/palmerpenguins/. 
doi:10.5281/zenodo.3960218.

Liaw A, Wiener M (2002). Classification and regression by
randomForest. _R News_, *2*(3), 18-22.
<https://CRAN.R-project.org/doc/Rnews/>.