import pandas as pd
import time
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
df = pd.read_csv("/courses/EDS232/Data/covtype_sample.csv")
Introduction
The code below was a lab I completed as part of my Masters of Environmental Science degree at the University of California in Santa Barbara. I have cleaned up the lab a bit to showcase it here, as I believe it is a good example of some of the machine learning skills I practiced while participating in this program. This was part of Matteo Robins’s EDS 232 course: Machine Learning for Environmental Data Science.
In this lab, I explored the application of Support Vector Machines (SVMs) and Random Forests (RFs) for multi-class classification using cartographic variables. Specifically, I predicted forest cover type based on a variety of environmental features such as elevation, soil type, and land aspect.
From the official class lab: “Understanding forest cover classification is crucial for natural resource management. Land managers and conservationists rely on accurate predictions of vegetation types to make informed decisions about wildlife habitats, fire management, and sustainable forestry practices. However, direct field assessments of forest cover can be costly and time-consuming, making predictive models a valuable tool for estimating cover types in large or inaccessible regions.”
Dataset Information
The data used for this lab comes from the UC Irvine Machine Learning Respository. This data is not from remote sensing, but rather from cartographic variables. To quote the data webpage: “The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types).
“This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.
Some background information for these four wilderness areas: Neota (area 2) probably has the highest mean elevational value of the 4 wilderness areas. Rawah (area 1) and Comanche Peak (area 3) would have a lower mean elevational value, while Cache la Poudre (area 4) would have the lowest mean elevational value.
As for primary major tree species in these areas, Neota would have spruce/fir (type 1), while Rawah and Comanche Peak would probably have lodgepole pine (type 2) as their primary species, followed by spruce/fir and aspen (type 5). Cache la Poudre would tend to have Ponderosa pine (type 3), Douglas-fir (type 6), and cottonwood/willow (type 4).
The Rawah and Comanche Peak areas would tend to be more typical of the overall dataset than either the Neota or Cache la Poudre, due to their assortment of tree species and range of predictive variable values (elevation, etc.) Cache la Poudre would probably be more unique than the others, due to its relatively low elevation range and species composition.”
Several machine learning papers have been published using this dataset. This lab excercise is simply a way to play with this important data, and does not attempt to glean any novel new information from it. I will walk you through a step by step process of the SMV and RF techniques I practiced on this data.
Step 0: Load Libraries and Data
Step 1: Data Preprocessing
Before building our classification models, I need to prepare the dataset by separating the features target variable (Cover_Type) and splitting the data into training and test sets.
I know that SVMs are sensitive to feature scale. Here, I use describe() to summarize the dataset.
df.describe()| Elevation | Aspect | Slope | Horizontal_Distance_To_Hydrology | Horizontal_Distance_To_Roadways | Hillshade_9am | Hillshade_Noon | Hillshade_3pm | Horizontal_Distance_To_Fire_Points | Wilderness_Area_Rawah | ... | Soil_Type_32 | Soil_Type_33 | Soil_Type_34 | Soil_Type_35 | Soil_Type_36 | Soil_Type_37 | Soil_Type_38 | Soil_Type_39 | Soil_Type_40 | Cover_Type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.00000 | 10000.000000 | 10000.000000 | 10000.000000 | ... | 10000.000000 | 10000.000000 | 10000.00000 | 10000.00000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 |
| mean | 2955.599500 | 154.450000 | 14.114700 | 268.097600 | 45.755300 | 2319.360300 | 212.19660 | 223.113500 | 142.243800 | 1960.040200 | ... | 0.091200 | 0.082900 | 0.00250 | 0.00250 | 0.000200 | 0.000500 | 0.026800 | 0.024500 | 0.013600 | 2.036600 |
| std | 281.786673 | 111.851861 | 7.499705 | 211.899673 | 58.034207 | 1548.558651 | 26.98846 | 19.871067 | 37.799752 | 1320.535941 | ... | 0.287908 | 0.275745 | 0.04994 | 0.04994 | 0.014141 | 0.022356 | 0.161507 | 0.154603 | 0.115829 | 1.383782 |
| min | 1860.000000 | 0.000000 | 0.000000 | 0.000000 | -164.000000 | 0.000000 | 68.00000 | 71.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.00000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 2804.750000 | 58.000000 | 9.000000 | 95.000000 | 7.000000 | 1091.750000 | 198.00000 | 213.000000 | 119.000000 | 1006.000000 | ... | 0.000000 | 0.000000 | 0.00000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 50% | 2995.000000 | 126.000000 | 13.000000 | 218.000000 | 29.000000 | 1977.000000 | 218.00000 | 226.000000 | 142.000000 | 1699.000000 | ... | 0.000000 | 0.000000 | 0.00000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| 75% | 3159.000000 | 258.000000 | 18.000000 | 384.000000 | 68.000000 | 3279.000000 | 231.00000 | 237.000000 | 168.000000 | 2524.000000 | ... | 0.000000 | 0.000000 | 0.00000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| max | 3846.000000 | 359.000000 | 65.000000 | 1243.000000 | 427.000000 | 7078.000000 | 254.00000 | 254.000000 | 246.000000 | 7111.000000 | ... | 1.000000 | 1.000000 | 1.00000 | 1.00000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 7.000000 |
8 rows × 55 columns
Based on what I see above, I know that some of these features have different scales. For example, the soil type columns are really small, while the distance and area columns contain a wide range of large numbers. So, I’m gonna scale it!
# Separate features and target variable
X = df.drop(columns=["Cover_Type"])
y = df["Cover_Type"]
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize the scaler
scaler = StandardScaler()
# Fit and transform the training data
X_train_scaled = scaler.fit_transform(X_train)
# Transform the test data
X_test_scaled = scaler.transform(X_test)Step 2: Hyperparameter Tuning for SVM
To optimize this SVM model, I need to search for the best hyperparameters that maximize classification accuracy. Since SVM performance depends heavily on C, kernel, and gamma, I will use GridSearchCV() to systematically test different combinations. I then initialize a cross validation object with 5 folds using StratifiedKFold. A resource from class: check out how StratifiedKFold differs from Kfold here.
Then, I set up a grid to test different values of: - C (regularization strength): how strictly the model fits the training data - Candidate parameter values: (0.1, 1, 10, 100) - kernel (decision boundary shape): compares linear and radial basis function shapes - Candidate parameter values: (linear, rbf) - gamma (influence of training observations): influence of individual points on decision boundary - Candidate parameter values: (scale, auto)
As models and datasets become more complex, consideration of computation time becomes more important. Here, I use time.time() to measure the time required to fit the grid object.
# Initialize a cross validataion object with 5 folds
cv = StratifiedKFold(n_splits=5)
# Set up the parameter grid, needed for the GridSearchCV
param_grid = {
'C': [0.1, 1, 10, 100],
'kernel': ['linear', 'rbf'],
'gamma': ['scale', 'auto']
}
# Support Vector Machines! SVM, though the function is spelled SVC() for some reason
svm = SVC()
# Do the grid search!
grid_search = GridSearchCV(estimator=svm, param_grid=param_grid, cv=cv, n_jobs=-1, verbose=0)
# Measure the time it takes to fit the grid object
start_time = time.time()
grid_search.fit(X_train_scaled, y_train)
end_time = time.time()
# Print the best parameters
print("Best Parameters:", grid_search.best_params_)
# Print the time required to fit the grid object
print("Time taken to fit the grid object: {:.2f} seconds".format(end_time - start_time))Best Parameters: {'C': 100, 'gamma': 'auto', 'kernel': 'rbf'}
Time taken to fit the grid object: 498.91 seconds
Step 3: Build a fit a Random Forest for comparison
Let’s compare our SVM to a Random Forest classifier. Here, I create a grid for cross-validation with three hyperparameters to tune, along with three sensible values for each one.
# OK, lets do a random forest to make comparisons
param_grid_rf = {
'n_estimators': [100, 200, 300],
'max_depth': [10, 20, 30],
'min_samples_split': [2, 5, 10]
}
# Call in the RF model
rf = RandomForestClassifier(random_state=42)
# Do the grid search, but on the rf
grid_search_rf = GridSearchCV(estimator=rf, param_grid=param_grid_rf, cv=cv, n_jobs=-1, verbose=0)
# Do the same fancy time trick
start_time_rf = time.time()
grid_search_rf.fit(X_train_scaled, y_train)
end_time_rf = time.time()
# Print the best parameters
print("Best Parameters for Random Forest:", grid_search_rf.best_params_)
# Print the time required to fit the grid object
print("Time taken to fit the Random Forest grid object: {:.2f} seconds".format(end_time_rf - start_time_rf))Best Parameters for Random Forest: {'max_depth': 30, 'min_samples_split': 2, 'n_estimators': 200}
Time taken to fit the Random Forest grid object: 14.68 seconds
Step 4: Model Predictions and Evaluation
Now that I have trained and optimized both a SVM and RF model, I evaluate their performances on the test set to prepare for model comparison. The steps I take are: - Use the best models from GridSearchCV() to make predictions on the test set - Generate a confusion matrix for each model to visualize classification performance
# Best SVM model from GridSearchCV
best_svm = grid_search.best_estimator_
# Best Random Forest model from GridSearchCV
best_rf = grid_search_rf.best_estimator_
# Make predictions on the test set
y_pred_svm = best_svm.predict(X_test_scaled)
y_pred_rf = best_rf.predict(X_test_scaled)
# Confusion matrix for SVM
cm_svm = confusion_matrix(y_test, y_pred_svm)
disp_svm = ConfusionMatrixDisplay(confusion_matrix=cm_svm)
disp_svm.plot()
plt.title("Confusion Matrix - SVM")
plt.show()
# Confusion matrix for Random Forest
cm_rf = confusion_matrix(y_test, y_pred_rf)
disp_rf = ConfusionMatrixDisplay(confusion_matrix=cm_rf)
disp_rf.plot()
plt.title("Confusion Matrix - Random Forest")
plt.show()

Step 5: Gather and display additional performance metrics
Now, I want to display the accuracy score and training time required for each model to so I can compare the models.
# Calculate accuracy scores
accuracy_svm = accuracy_score(y_test, y_pred_svm)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
# Print accuracy scores and training times
print("Accuracy for SVM: {:.2f}".format(accuracy_svm * 100))
print("Time taken to fit the SVM grid object: {:.2f} seconds".format(end_time - start_time))
print("Accuracy for Random Forest: {:.2f}".format(accuracy_rf * 100))
print("Time taken to fit the Random Forest grid object: {:.2f} seconds".format(end_time_rf - start_time_rf))Accuracy for SVM: 76.57
Time taken to fit the SVM grid object: 498.91 seconds
Accuracy for Random Forest: 80.50
Time taken to fit the Random Forest grid object: 14.68 seconds
Step 6: Compare the models
Now that we have trained, optimized, and evaluated both SVM and RF models, we will compare them based on overall accuracy, training time, and types of errors made.
The accurace for SVM is a little bit lower, at around 76.57, whereas the accuracy of RF was 80.50. The time it took to run the SVM was also significantly longer, at 498.91 seconds, as opposed to 14.68 seconds. So in general, I think that in this case, the random forest model is the better choice. In a scenario where computation time was not an issue, an argument could be made for SVM. But if simplicity is the name of the game, than Random Forests is the way to go.