An open-source image classifier for characterizing recreational activities across landscapes

1. Environmental management increasingly relies on information about ecosystem services for decision-making. Compared with regulating and provisioning services, cultural ecosystem services (CES) are particularly challenging to characterize and measure at management-relevant spatial scales, which has hindered their consideration in practice. 2. Social media are one source of spatially explicit data on where environments support various types of CES, including physical activity. As tools for automating social media content analysis with artificial intelligence (AI) become more commonplace, studies are promoting the potential for AI and social media to provide new insights into CES. Few studies, however, have evaluated what biases are inherent to this approach and whether it is truly reproducible. 3. This study introduces and applies a


| INTRODUC TI ON
Ecosystem services are a widely accepted framework for incorporating and accounting for human well-being in environmental assessments and management decisions (IPBES, 2019;Mandle et al., 2021 ;TEEB, 2010).Beyond the material services provided by ecosystems, the framework emphasizes consideration of the nonmaterial and cultural ecosystem services (CES) that benefit people through increased physical, mental, social and spiritual health (Fish et al., 2016;IPBES, 2019;Russell et al., 2013).However, CES are rarely considered in practice, especially compared with regulating and provisioning services such as water supply and carbon sequestration (Alix-Garcia & Wolff, 2014;Goldman-Benner et al., 2012;Gould et al., 2019).The potential of CES research has not been fully realized by managers, at least in part due to difficulties characterizing and measuring CES, as well as mismatches in the scale of CES studies and management decisions (Baumeister et al., 2020;Gould et al., 2019;Guerry et al., 2015;Hernández-Morcillo et al., 2013;La Rosa et al., 2016;Satz et al., 2013).
Recent research has proposed that challenges to conceptualizing and measuring CES can be overcome by analysing information shared on social media (Calcagni et al., 2019;Gliozzo et al., 2016;Wood et al., 2013).This idea is premised on the observation that people use social media platforms such as Twitter and Instagram to share descriptions and depictions of their cultural experiences and interactions with the surrounding environment (Di Minin et al., 2015;Ghermandi & Sinclair, 2019;Langemeyer & Calcagni, 2022).
Additionally, since social media are often georeferenced-meaning they are tied to the location where the content is created-they provide a spatial record of human-ecosystem interactions.As Havinga et al. (2020) demonstrate, geotagged social media are useful for measuring many types of CES including physical activity, aesthetic appreciation, ecological meaning, the development of knowledge and spiritual importance.Indeed, studies leveraging publicly available social media have begun investigating where environments provide aesthetically pleasing landscapes (Egarter Vigl et al., 2021;Figueroa-Alfaro & Tang, 2017;Ghermandi et al., 2020), enjoyment of plants and animals (Richards & Friess, 2015), and opportunities to participate in religious (Roberts, 2017), spiritual (Oteros-Rozas et al., 2018) or recreational activities (Väisänen et al., 2021).
Most prior CES research using social media has relied on total counts of users who share content to map recreational visitation (Tenkanen et al., 2017;Wilkins et al., 2021;Wood et al., 2020) and understand how visitation varies across environments or conditions (Fisher et al., 2018;Kim et al., 2019;Levin et al., 2017).More recently, studies have begun utilizing not only the location and volume of social media, but also the content of the posts (e.g.Clemente et al., 2019;Egarter Vigl et al., 2021;Gosal et al., 2019).In some studies, the content contained in images is classified by manually viewing and labelling images (Clemente et al., 2019;Langemeyer et al., 2018;Oteros-Rozas et al., 2018;Pickering et al., 2020;Richards & Friess, 2015;Thiagarajah et al., 2015).This method is inefficient, however, and it is difficult to scale (an average of 140 photos per researcher per hour (Richards & Friess, 2015)), so recent studies have explored automated methods using artificial intelligence (AI) provided by commercial tools such as Google Cloud Vision (Gosal et al., 2019;Richards & Tunçer, 2018) and Clarifai (Egarter Vigl et al., 2021;Lee et al., 2019).Gosal et al. (2019), for example, used Google Cloud Vision combined with a latent semantic analysis in order to map several recreational values and identify where recreation may threaten particular species within protected areas.Lee et al. (2019) used Clarifai to annotate images, then performed a network analysis to derive themes of the photos, several of which were related to cultural services.These studies were an innovative step towards using publicly shared photographs to learn about recreational, aesthetic and other cultural benefits which people derive from nature, and they illustrate the potential for AI to be applied to a wide variety of questions related to these topics.
Despite the demonstrated potential for AI such as computer vision to be used to measure aspects of CES from social media, few studies have evaluated what biases are inherent to this approach and whether it is truly reproducible.There are several potential issues.
First, the AI underlying commonly used tools may not be generating consistent predictions about the content depicted in images.Yet, this is difficult to evaluate since nearly every study to date has relied on proprietary tools that are implemented in commercial software (but see Väisänen et al., 2021).These commercial tools are generally not well documented and are often modified (or decommissioned) without notice nor consultation with users.This lack of transparency makes it difficult to reproduce analyses and to apply standard methods for evaluating model accuracy and performance since the outputs of a model may differ in unknown ways over time (Lazer et al., 2014).In contrast, with open-source tools and algorithms, it is feasible to test and evaluate model performance in different situations and understand how outcomes are related to model structure and parameters.Second, even when computer vision models perform well, the content of the images that are shared as social media may not accurately represent the ways that people interact with the environment.There are likely biases in both who chooses to share images to particular platforms and in what type of content is represented in those images (Mashhadi et al (Muñoz et al., 2020;Tolvanen et al., 2020) or surveys (Heikinheimo et al., 2017;Song et al., 2020), it is impossible to say whether applying AI to social media is generating consistent, complementary or even directly contradictory information about CES.
This study explores the potential for social media and computer vision to map where ecosystems create opportunities for recreational activity, as one example of a CES.We introduce a convolutional neural network (CNN) that uses AI to recognize outdoor activities in the content of photographs posted to social media.We describe the creation of a training dataset and our development and evaluation of the model using 13 years of photographs from Flickr.
To test how well the model performs, we compare model predictions to labels assigned manually by researchers, and discuss the performance overall, and by activity class, both in the original study region and in a novel region of the same forest.Then we demonstrate that there are biases in the frequency with which various activities are shared on social media, by comparing the predicted activities to activities reported by respondents to an on-site survey administered on the same public lands in Washington, USA.Finally, we present an example application of the methods by creating maps of recreational activities at two locations and relating the diversity of activities present in different parts of the region to underlying landscape characteristics.To facilitate future model development and to ensure reproducibility, we share an entirely open-source software package called recCNNize 1 as well as the fitted weights of our CNN model so that others can replicate our approach for identifying recreational activities in images.

| ME THODS
In this study, we develop, test and apply a classifier to recognize recreational activities in images shared on social media from two locations on public land in Washington, USA.

| Study site
This study focuses on public lands in the Mount Baker-Snoqualmie National Forest in western Washington, USA (Figure 1).We de-

| Flickr data
We collected all photographs posted to Flickr within the Middle

| On-site surveys
Researchers surveyed visitors leaving the Middle Fork Valley on 23 days between August 2 and October 10, 2018.We selected survey days throughout the survey period to include an even number of weekday (Monday-Thursday) and weekend (Friday-Sunday) days.After intercepting a group of visitors, researchers explained that they were studying outdoor recreation in the Middle Fork Valley, that participation was entirely voluntary, and that responses would be anonymous.They then verbally asked whether the visitors would participate in the study.Upon receiving verbal consent, the researchers asked a random member of the party (the adult with the next upcoming birthday) to complete a written survey in English.The intercept survey was part of a larger project to describe outdoor recreational use in the region, and among other questions it asked visitors to select the activities they participated in during their visit to the region from a predefined list (Table 1).In total, 595 visitors representing 595 parties completed a survey, and 580 of the respondents reported what recreational activities they participated in during their visit.The University of Washington Institutional Review Board reviewed our study and determined that the research met regulations for the protection of human subjects and the study carried no greater than minimal risk to those subjects.Accordingly, they granted the study exempt status (IRB ID: STUDY00005339).

| Recreational activity classes
We chose 12 of the most common recreational activities in the Middle Fork for image classification (Table 1).Additionally, we created two 'no activity' classes to aid in image classification-one for photographs without any people present, and the other for photographs where people were present but not obviously engaging in any particular activity.

| CNN classifier
We developed a CNN to measure the probability that Flickr images contained evidence of the 14 recreational activity classes (Table 1).
We regarded the identification of activities in images as a multinomial classification problem.In other words, given a photograph X i from location i , we evaluated whether the photograph reflected recreational activity a.We customized and fine-tuned Google's Inception (InceptionResNetV2) model (Szegedy et al., 2017) to recognize patterns corresponding with our focal recreational activities.
We chose to use this particular model because it is widely recognized as one of the best performing CNN models, compared to competing models such as Resnet and GoogLeNet (Bianco et al., 2018;Canziani et al., 2017).While it is slower to train, we expected the deepness and complexity of this model to be more likely to capture subtle differences between activity classes.Furthermore, the training speed was not a critical concern since our study used a relatively small number of training images as compared to other work using CNNs.
The final model contained 164 layers, the first 160 of which were taken directly from the original Inception model.Because these early layers recognize lower level features, such as edges and basic shapes, we were able to take advantage of them by re-using the trained weights provided with the original CNN model (Szegedy et al., 2017), in a process known as transfer learning.
We customized the model by removing the final dense layer from the Inception model (which is trained to recognize 1000 classes in ImageNet photos), and replacing it with four custom layers which we trained using the training data described in Section 2.3.3.Specifically, we added a 2D global average pooling layer, a new dense layer with 1024 classes, a 30% dropout layer (i.e.30% of the features are set to zero randomly) and a dense layer with softmax activation that assigns class labels to images (n = 14) (Chollet et al., 2018) (Figure S1).
The average pooling and the dropout layers are included to avoid overfitting.Note that the dropout layer was ignored when testing at the end of each training epoch, so all features were used in predicting validation images.A consequence of using a dropout layer is that the model can yield higher validation accuracy than training accuracy.The final softmax layer calculates the probability of an image belonging to each of the possible classes.In the analysis, we regarded this probability output as the model's confidence on target images, and call the class with the highest probability the top-1 class label.We used the top-1 prediction in later analyses.

| Classifier training
We randomly sampled 6459 (44%) of the Flickr images downloaded from the Middle Fork and manually assigned each to one of the recreational activity classes described in Table 1.There was an embedded data imbalance that led to a scarcity of some activity classes in the random photographs (e.g.only one image depicting trail running and three images depicting fishing).Following an initial round of training, we added training data for some poorly performing activity classes by downloading public tagged photographs from ImageNet 2 and Pixabay 3 (Table S1; Figure S2).These additional images were used primarily to increase the number of images included in the minor classes, as well as adding photographs for classes which were frequently confused with one another (such as mountain biking and the 'other activities' class which included motorcycling).These supplementary images were mainly in situ images relevant to the activities, but for some classes without sufficient in situ images we also included images depicting elements or objects such as backpacks that we believed to be associated with a specific activity.The full training dataset was 11,912 photographs (6459 Flickr images from the Middle Fork and 5453 images from the ImageNet and Pixabay databases; Table S1).
We used the 11,912 labelled images to train the final four custom layers in our CNN model.In other words, only the four layers were trained against the labelled Flickr photos while the early layers were unchanged or 'frozen' during training (Chollet et al., 2018).
We sought to minimize the categorical cross-entropy loss function, a common function used to estimate how well predicted class probabilities match the target classes, using the Adam optimization algorithm (learning rate = 1e-5) (Chollet, 2015).We allowed the algorithm to conduct up to 300 complete cycles through the training dataset (i.e.epochs) as it searched for the optimal solution and training weights, but instructed it to stop training if the training accuracy did not improve in 20 consecutive epochs.
We trained the model on 60% (7140) of the labelled images and withheld 40% (4772) of the images for validation during the training process.To compensate for the small number of images in some classes, we augmented our training data by applying random geometric transformations in each epoch (such as flipping, resizing, brightening and rotating, Table S2).We chose a batch size of 512, meaning that 512 of the labelled images were used to train at a time, causing the algorithm to run through 14 folds (7140 ∕ 512) in each epoch.In each fold, the predictions were iteratively evaluated against the true class labels using categorical accuracy (i.e.training accuracy), and then the model weights were updated.Thus, the model was updated 14 times in each epoch.After each epoch, the algorithm evaluated model performance on the 4772 validation images which were not used for training (i.e.validation accuracy) and the optimizer routine decided whether to continue or stop training.

TA B L E 1
The 12 activity and two no-activity classes that we trained our image classifier to recognize.The description is our criteria for deciding whether or not to assign a class during manual evaluation, and the survey categories are the closest matching activities from the predefined list on the survey instrument  S3).The validation accuracy and loss were better than the training accuracy and loss in this study, which is likely due to the dropout layer in the CNN model.

| Classifier evaluation
To evaluate model performance beyond the training data, we carried out two external validations.First, we randomly selected 742 of the Middle Fork Flickr photographs that were not used in the training (n = 8380) and manually evaluated whether they belonged in the class with the greatest probability returned by the CNN model (the top-1 prediction).These test photographs were stratified to represent 20% of the images in each top-1 prediction class, and represented 6.2% of all Flickr photos in the Middle Fork.Additionally, we tested the model in the Mountain Loop region.We predicted the activity class of all Flickr photographs acquired in this region (n = 18,350), using the model trained on the Middle Fork images.We randomly sampled the predicted photographs from this site using the same sampling scheme, and manually evaluated the activity classes (n = 491; 2.7% of the total images).
The manual evaluation consisted of two examiners independently assigning the sampled images to their 'true' class and comparing these results to the model predictions.We evaluated these predictions by creating a confusion matrix and calculating several standard metrics (precision, recall and F 1 score) for each class, as well as summarizing overall model performance using accuracy, macro F 1 , and Cohen's unweighted Kappa (Kuhn et al., 2008).
Each of the class-level metrics that we used were calculated from a combination of the number of true positives (TP, images which be-

| Survey comparison
To examine how well images of activities shared on Flickr, as classified by our model, represent actual rates of activity participation in the region, we compared our model predictions to the empirical survey data.For this comparison, we subset our Flickr photographs to only include those which were geotagged within recreation areas which we believed to be accessed primarily by the Middle Fork Road, excluding some locations which could be more easily accessed from a nearby highway.This resulted in 8396 photos.To control for users who may have posted multiple photographs of a single activity, we calculated activity photo-user-days (APUD) (following Wood et al., 2013) as the number of unique users posting images that were classified (using the top-1 prediction) as being in each activity, each day.APUD is not directly analogous to PUD, since a single user's photographs may represent multiple activities on a single day.In total, we found that the 8396 photos reflected 2321 APUD, 1076 of which represented recreational activities (1245 were classified into one of the no activity classes).We chose to include all photographs ever posted to Flickr in the region because there were too few photographs posted during the months of the survey for a meaningful comparison (only 15 APUD representing recreational activities in August-October 2018).We compared total APUD per activity to the number of survey respondents who reported participating in each activity and measured correspondence between the two datasets by calculating Pearson's correlation.

| Case study
Finally, to demonstrate one potential use of this model, we used all of our classified images from both regions to create maps of the frequency of recreational activities shared on Flickr.Due to the relatively small number of photographs representing some of the activities, we focused on the diversity of activities found across the landscape.We did not include fishing, trail running or horseback riding in our diversity calculation because these classes were poorly represented in our data.We divided each study area into 2 km grid cells, and for each grid cell we calculated the total APUD for each activity, based on the top-1 prediction from the CNN.We then calculated and mapped the number of distinct activities represented in each grid cell.Because we removed the three data-poor classes above and the two no activity classes, the maximum number of activities possible in a grid cell was nine.
We related the number of distinct activities in a grid cell to underlying landscape features to learn about the features of the landscape that drive a greater diversity of activities in each region.For this step, we pooled the two regions and used negative binomial regression, where the number of activities was modelled as a function of whether or not the grid cell contained a campsite, lake, picnic area, river or trail, as well as the minimum distance to the nearest major road, the elevation of the centre of the grid cell and the proportion of the grid cell which is designated wilderness.All variables were scaled to fall between 0 and 1 so that coefficients could be compared directly.To avoid issues with multicollinearity, we checked for strong correlations between the variables (all were < | 0.4 |).We compiled landscape data from the US Forest Service 4 (campsites, picnic areas, trails and wilderness boundaries), Washington State Department of Transportation 5 (roads), Washington State Geospatial Open Data Portal 6 (rivers and lakes) and USGS Elevation Point Query Service 7 (elevation), accessed using the elevatr package in r (Hollister, 2018).We chose to use a negative binomial GLM because the number of activities occurring in a cell is a discrete, rather than continuous variable, and our data were overdispersed so a Poisson model was not appropriate.We measured the predictive power of our model using a pseudo-R 2 metric (Zuur et al., 2009).

| Classifier evaluation
The classifier performed well in the Middle Fork region, achieving overall accuracy of 0.71 (95% CI: 0.67-0.74;macro F 1 = 0.61; Cohen's unweighted Kappa = 0.59) on the randomly sampled test  2, Figure 3).The model performed well on images from the bird watching, hiking and no activity classes, with relatively high F 1 scores driven by high recall (indicating that the model was able to successfully identify most images which truly represented these classes) and high precision (indicating that most of the photographs classified in these activities were correct).Note that the perfect F 1 score for boating is due to an extremely small test sample (support = 1), so further testing is necessary to accurately judge the performance of this minor class.The swimming and backpacking classes both had high recall but low precision, indicating that the model was able to successfully identify photographs that included these activities, but that it also incorrectly assigned many other photographs to these classes.
Because of the low precision of these classes, their F 1 scores were correspondingly low.The other activities class had relatively low recall and precision, likely due to the broadness of the category.Some of the minor classes (fishing, horseback riding and trail running) did not have F 1 scores; fishing and horseback riding were never classified correctly in our test set (precision = 0, despite evaluating two photographs which our model classified as fishing, and 10 images classified as horseback riding), while trail running was not selected as the top-1 activity for any of the photographs in the Middle Fork (Table 2; Figure 2).Between activity classes, there are also notable differences in the confidence of the top-1 class assignments (Figure S5).Classes such as bird watching, no activity and camping were frequently assigned with probability > 0.9, representing the high confidence of the CNN model.On the contrary, backpacking and other activities were assigned a lower probability even when they were selected as the most likely class (top-1).

| Regional test of classifier
The model performed almost as well in the novel Mountain Loop region as in the Middle Fork, with an overall accuracy of 0.60 (95% CI: 0.56-0.65;macro F 1 = 0.59; Kappa = 0.53; p < 0.0001; Figure 3, Table S3, Figure S4).Compared to the Middle Fork region, the model performance in the Mountain Loop decreased by 12.7% (overall accuracy), 3.3% (macro F 1 score) and 10.2% (Kappa).While class performance also varied widely in the Mountain Loop, the classes which performed well were not always the ones which performed well in the Middle Fork (Figure 3).In particular, camping performed much better in the Mountain Loop (F 1 score = 0.92) than in the Middle Fork (F 1 score = 0.30).This may be due to a substantially larger number of photos of this activity in the Mountain Loop region as compared to the Middle Fork.The swimming, backpacking and other activities classes each performed slightly better in the Mountain Loop, while all other classes performed somewhat worse in the novel region.
Note that fishing only had one test photo in the Mountain Loop, so as with boating in the Middle Fork performance metrics are not indicative of true performance.

| Survey and model prediction comparison
We did not find a strong correlation between the log number of APUD and the log number of survey respondents who reported participating in each activity across the Middle Fork (r = 0.32; 95% CI: −0.31 to 0.75, Figure 4).There were far more survey respondents who reported trail running or fishing, in particular, than there were photos classified as belonging to these activities.While we do not have F 1 scores for these classes (due to a lack of successful classifications in our test set), we do know that each of these classes was extremely uncommon in our labelled photos from the Middle Fork (one and three photos, respectively, out of 6459 labelled images; Table S1).When we removed these two classes, as well as horseback riding, which also had no successful classifications in the manual evaluation, we found a much higher correlation between log APUD and log survey respondents (r = 0.73; 95% CI: 0.12-0.94;dashed line in Figure 4).Still, the relationship is not perfectly linear, with some activities (such as swimming and bird watching) apparently under-represented in the photos and others (rock climbing, boating) perhaps over-represented in the photos.
These patterns are not explained by model performance, with both high-performing and low-performing activities falling on both sides of the line (Figure 4).

| Spatial variability in activities
We found hotspots of recreational activities in both of our study regions (Figure 5).In these areas, we found evidence of visitors participating in up to nine distinct activities.These nine activities include all of the recreational activity classes that our CNN model was able to identify with confidence (not including the minor classes trail running, fishing or horseback riding, nor the two no activity classes).
Other portions of the landscape appear to support fewer recreational activities.The number of distinct activities occurring in a grid cell was highly correlated with total APUD (r = 0.70), indicating that a greater number of activities occur in more popular areas.
The landscape features which we tested in the negative binomial regression model explained some of the variability in activity diversity across the landscape (pseudo-R 2 = 0.22).In particular, a greater number of activities occurred in areas near roads, at higher elevations, and in areas with trails, rivers, lakes or campgrounds.Of these landscape features, proximity to roads had the greatest effect size, with the number of activities present in a grid cell dropping off quickly as distance to the nearest road increased.The presence of either rivers or lakes was significantly positively correlated with the number of activities, though the effect size was smaller than the effect of the infrastructure variables (roads, trails and campgrounds).
We found no significant relationship with picnic areas nor the proportion of wilderness in a cell (Figure 6).

| DISCUSS ION
This study introduces and applies a reproducible approach for study- with careful consideration of underlying biases, there is potential for this approach to identify how recreational activities covary with features of the landscape-thereby explicitly measuring how the CES is supported by several features of the natural and built environments.
More broadly, this study serves as an example of how to build, distribute and apply AI to understand novel aspects of interactions between people and ecosystems.

| Classifier performance
While the accuracy and confidence of our classifier is high overall, it varies substantially across the activity classes in our study, and this finding has broader implications for CES research using computer vision.Our CNN is best at recognizing popular or clearly defined classes such as boating and bird watching that are associated with recognizable objects (Table 2, Figure 3).The model struggles, meanwhile, to distinguish some activity classes that are visually similar to each other, such as hiking and backpacking which involve similar equipment and both occur on trails (Figure S6).There are also important regional differences in model accuracy for some classes such as camping.Despite being trained with images from the Middle Fork region, images of camping are more identifiable in the Mountain Loop.These regional and inter-class difference in performance and accuracy-which are inherent to all image classifiers, including commercial and proprietary ones-highlight the importance of measuring, reporting and addressing the potential for uncertainty in AI to generate biased results.

Our comparison between activity participation according to
Flickr photographs and the on-site survey supports previous research concluding that visitors share social media about their participation in a multitude of recreational activities (Hartmann, 2019;Heikinheimo et al., 2017;Norman et al., 2019;Väisänen et al., 2021).
Looking more closely at the relative frequency of each activity, we find that some activities are more popular among survey respondents than among social media users, and vice versa.Trail running, fishing and swimming are less prevalent in social media images compared to the survey responses, whereas rock climbing is more popular on social media.We suspect that this is largely because visitors are less likely to share photographs of certain activities on social media relative to other activities (as suggested by Ghermandi & Sinclair, 2019;Tenerelli et al., 2016;Wood et al., 2013).The measured differences may also be due to the timing of the survey.Swimming is more common in the summer when the survey occurred, and one survey day coincided with a trail running event, meaning that these activities are likely over-represented in the survey data.However, trail running events are not uncommon, and the fact remains that of the 6459 images we manually labelled, only one was of trail running.Additionally, the disconnect may be due partly to differences in time-scales of the two datasets.Due to the relatively small number of photographs posted to Flickr during the survey period (15 APUD representing recreational activities were posted between August and October 2018), we chose to include all photographs posted over the life span of the platform (2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018).
The general correspondence in the CNN model predictions between the two regions indicates that there is potential for the model to be applied with relative confidence to characterize recreational activities across larger landscapes.However, the disconnects between the survey results and the images uploaded to social media also suggest that there are sampling biases that should be taken into account before the current model is generalized widely.If some activities are better represented in social media photographs than other activities, there is potential for activities by some user groups to be underestimated or missed entirely if this method is applied without caution.

| Case study
After accounting for potential issues with the underlying social media data and image classifier, we were able to create maps of recreational activity diversity in two distinct regions of the Mount Baker-Snoqualmie National Forest, for the nine activities that our CNN model can identify with confidence.Our observation is that there are 'hotspots' that support a larger diversity of recreational activities than their surroundings (Figure 5).These hotspots also have more visitors over time, evidenced by a correlation between the total number of activities and APUD.This may be due to more popular areas encouraging a greater diversity of activities, or because areas which allow a greater diversity of activities draw more people.
Locations that support a large diversity of activities may also support a greater number of other CES, although this work does not directly measure these other types of interactions with the ecosystem.

Our analysis of activity patterns in Mount Baker-Snoqualmie
National Forest serves to demonstrate one of many potential applications of fine-grained maps of recreational activities.Specifically, we show how information about the spatial variability in recreation can be leveraged to understand how the CES is related to the ecosystem and elements of the built environment-similar to previous studies which have explored a variety of methods for using social media and PPGIS for this purpose (Bagstad et al., 2016;Fisher et al., 2019;Kliskey, 2000;Levin et al., 2017;Sherrouse et al., 2011).
In the Mount Baker-Snoqualmie National Forest, hotspots of recreational activity occur most often in areas with natural features (rivers, lakes, higher elevations) supported by built infrastructure such as campgrounds and access via trails and roads.Of the landscape features which we considered, proximity to a road had the greatest impact on the diversity of recreational activities.While we chose to consider the diversity of recreational activities, due largely to the relatively low number of images depicting many of our more minor classes, future work could apply a similar model and framework to model individual activities.Additionally, the 'no activity' class could be further divided into various other CES types such as aesthetic and heritage value and analysed similarly (Havinga et al., 2020).
Specific knowledge of activity locations and how they depend on the underlying ecosystems can support sustainable recreation planning.Activity maps can help managers identify areas with high potential for environmental degradation and conflict between user groups that need mitigation or supporting infrastructure.
Information on the relative importance of natural and built environments for recreational use allows managers to target spending on ecological restoration projects and the maintenance or construction of infrastructure improvements.In the Mount Baker-Snoqualmie National Forest, our analyses indicate that by improving access to areas which are currently undeveloped-particularly areas with natural features such as lakes or rivers-managers in the region could increase the likelihood of attracting visitors who will interact with sites in a wider variety of ways.Furthermore, knowing which types of environmental features support a particular recreation activity could help practitioners argue for the value of those landscapes.This explicit valuation of public lands as 'spatial assets' (Jepson et al., 2017) gives managers and conservationists another tool with which to justify continued protection of these areas.

| Limitations and precautions
This study highlights several important precautions when using social media for CES research.As our results show, not all recreational activities are represented proportionally in images posted to Flickr.Beyond the conclusion that certain types of activities are better suited to photographs than others, Flickr users are unlikely to be a representative sample of visitors to the study region (Ruths & Pfeffer, 2014).In the Middle Fork in particular, more than onethird of the photographs posted to Flickr were uploaded by a single user.While we mitigated this impact by calculating APUD rather than working directly with total numbers of photographs, this user's choice about what content to upload clearly impacted our results.
This is an extreme case, but it is important to remember that images from any location may be biased towards particular types of activities that are popular with the dominant Flickr user-group.Future studies would be wise to temper this by working with images from multiple social media platforms (Wood et al., 2020).
Studies that store and analyse images containing people should follow practices that protect the privacy of individuals who appear in the images (Di Minin et al., 2021).Automated analyses may provide a slight advantage in this regard, as fewer of the photographs are actually observed by human researchers (Väisänen et al., 2021).
However, creating training data and validating results still requires researchers to directly observe some photographs containing individuals who have not explicitly opted in to this research, or necessarily consented to having their likeness shared publicly on social media in the first place.For this study, we stored Flickr images on a restricted disk and then limited the number of researchers who directly viewed the images.Then, we chose to make our training weights, but not the training images themselves, public.This approach poses little risk to privacy.Future CES research using computer vision would be best served by a public repository of images and CNN models to facilitate model building with common training data and benchmarks.
Furthermore, such a repository could address ethical and privacy concerns if images were crowd-sourced from individuals who voluntarily submitted content with explicit permissions and restrictions on the uses of those images (Mashhadi et al., 2021).Individuals could optionally self-report their recreational activities, along with other information about their experience and interactions with the environment.This public repository would be invaluable for researchers applying computer vision to questions about CES.

| Technological advances
By developing an open-source and reproducible approach, this study overcomes several limitations of previous studies that used proprietary tools such as Google Vision and Clarifai (e.g. Egarter Vigl et al., 2021;Lee et al., 2019;Richards & Tunçer, 2018).Among the many issues with proprietary tools are that researchers do not know how closed-source models are constructed, how they are trained, when they change, how long they will be available, and at what price.In contrast, an open-source model facilitates science that is reproducible, testable, improvable and accessible.
The CNN model weights that we have provided could be applied in research on these activities in other regions, though we also recommend that future studies expand and improve the CNN by retraining it with additional images that capture a broader range of situations.Furthermore, a community of practice focused on the goal of testing and training models for recognizing different types of recreational activities could ultimately create more accurate and trusted tools for management compared to proprietary tools.Freely available tools that are properly evaluated, improved and run by the community could additionally build capacity within organizations that would otherwise be unable to afford commercial solutions.

| CON CLUS ION
Together, computer vision and volunteered geographical information from social media have the potential to help overcome challenges to conceptualizing and measuring CES over large geographies.It is important to recognize that these techniques can suffer from biases created during both the data generation and analysis stages of a study, illustrating the danger of relying on unvalidated data sources and models when drawing conclusions about visitors to public lands.
Yet, we conclude that carefully applying AI to user-generated con- veloped our classifier using photographs from the Middle Fork Snoqualmie River Valley (Middle Fork), which is a popular recreation destination located approximately 50 km east of the Seattle metropolitan area.The Valley contains lush coniferous forest in the foothills of the Cascade Range, ranging from 150 to 1850 m above sea level.Due to its low elevation and temperate maritime climate, the valley floor and access roads remain snow-free yearround.The Middle Fork Snoqualmie River is a designated Wild and Scenic River, and approximately half of the study region is in the Alpine Lakes Wilderness.The area provides a variety of recreational opportunities, ranging from trails for hiking, biking and horseback riding to campgrounds and day-use picnic sites.Vehicle access to this valley is via a single road, making it a relatively selfcontained recreation destination.We tested our classifier in the Mountain Loop Highway corridor (Mountain Loop) region of the same National Forest, located approximately 75 km northeast of Seattle, WA.This area is another common destination for visitors, though the amount and character of recreational use differs between the two regions.The Mountain Loop region has steeper F I G U R E 1 Map of the Middle Fork and Mountain Loop study regions in western Washington, USA, showing the location of all geotagged Flickr photographs posted within the boundaries between 2005 and 2018.terrain and it includes both developed and informal campsites.As a result, it tends to draw individuals interested in physical challenge as well as a larger number of overnight visitors.Portions of the region are snow covered and inaccessible during the winter months.We created spatially explicit boundaries around each region based on access.The Middle Fork boundary encompassed just over 450 km 2 around recreation sites primarily accessed via the Middle Fork Road.The Mountain Loop boundary included almost 1450 km 2 , with recreation sites primarily accessed from the Mountain Loop Highway or from State Route 530.
Fork and Mountain Loop regions between 2005 and 2018 by querying the Flickr application programming interface (API) in December, 2018-January, 2019.This resulted in 14,839 photographs from the Middle Fork region (from 688 unique users, max.5940 photos from a single user) and 18,350 photos in the Mountain Loop region (from 863 unique users, max.681 photos from a single user).
longed to an activity class and were correctly classified as belonging to that class), false positives (FP, images which did not belong to an activity class but were incorrectly assigned to that class), false negatives (FN, images which belonged to an activity class but were incorrectly assigned to a different class) and true negatives (TN, images which did not belong to an activity class and were correctly not assigned to that class).Precision is the positive predicted rate ( TP TP + FP ), or how many of the samples predicted as positive are actually positive.It can be interpreted, for example, as 'How many of the images labelled hiking are actually images of hiking?'.Recall is the true-positive rate ( TP TP + FN ), also known as sensitivity, and represents how many of the positive samples are captured by the positive predictions.It answers the question 'How many of the hiking images are actually captured in the hiking class predictions?'The F 1 score isthe harmonic mean of precision and recall, creating a single metric of class performance.The harmonic mean gives greater weight to lower values, so classes will only have a high F 1 score if both precision and recall are high.We chose not to focus on specificity, the rate at which a model correctly detects that an image does not belong to a particular class ( TN TN + FP , true-negative rate), because metrics which use TN tend to be less insightful with unbalanced datasets.This is because minor classes will tend to have very large numbers of true negatives, since most of the images belong to other classes.We summarized overall model performance by calculating accuracy, macro F 1 score and Cohen's unweighted Kappa.Accuracy is the proportion of all predictions that are correct.The macro F 1 score is the average of each individual class' F 1 score, giving equal weight to every class.Finally, to confirm whether the model performs better than an uninformed guess, we measured the model's performance using Cohen's unweighted Kappa and carried out a one-sided test of the overall accuracy, which compared the overall accuracy to the rate of the largest class(Kuhn et al., 2008).
images (Figure2, Table2).It classified images into recreational activities better than would have been expected by chance (one-sided F I G U R E 2 Confusion matrix for top-1 predictions in the Middle Fork region using photographs randomly sampled with stratification (n = 742) overall accuracy p < 0.0001), with large variability in the model's performance across activity classes (Table ing spatial variability in nature-based recreational activities based on content in social media images.Our CNN model is able to classify images according to 12 distinct recreational activities and two no activity classes.This is, to our knowledge, the first open-source solution to be evaluated for this particular classification problem.While demonstrating that is it possible to identify individual recreational activities in images, we also show that there are apparent biases in F I G U R E 4 The number of participants in each of 12 recreational activities across the entire Middle Fork study region.Colours represent the F 1 score for each class based on the manual evaluation in the region.Classes without an F 1 score are shown in grey.The dashed line shows the relationship between surveys and APUD for activities with an F 1 score, with the light grey ribbon showing the 95% confidence interval F I G U R E 5 Maps showing the number of distinct activities by 2 km grid cell across the Middle Fork (a) and the Mountain Loop (b) regions, according to Flickr images classified by the CNN model.Study region boundaries are shown in white choose to photograph and post to social media.Two popular recreational activities (fishing and trail running) in our study area in western Washington, for example, are almost entirely absent from Flickr photographs taken within the region.Yet,

F 1
If differences are due to prediction errors by the CNN model, then the approach would benefit from additional training data in the form of labelled images.While we found no significant relationship between F 1 scores and the number of training images, three classes were so uncommon in the Flickr images that we were unable to calculate scores for them.Future studies working at sites with higher visitation and thus more photographs, or incorporating more popular social media platforms such as Instagram, will be able to improve our open-source classifier by retraining the CNN model.To support CNN model development-and to advance the use of image recognition in CES research, more broadly-future studies should report common metrics for evaluating performance, identifying data imbalances and guiding future training.In this study, we F I G U R E 6 Coefficients from the model relating diversity of activities to underlying landscape features in two study regions (n = 554, psuedo-R 2 = 0.22).Coefficients which are insignificant at = 0.05 are partially transparent include our confusion matrices and report three standard metrics for machine learning: namely F 1 score, precision and recall.These key statistics facilitate open research and development of tools that are transparent, understandable and improvable.
tent allows researchers and practitioners to learn about patterns of recreational activities across larger spatial scales than would otherwise be feasible.In our study region, by leveraging this technique, we were able to demonstrate that a greater diversity of activities occur in parts of the landscape which exhibit certain natural and built features.We believe that this study and our open-source image classifier are important steps towards creating reproducible and actionable information about recreation, offering researchers a new tool to characterize one type of CES at scales which are relevant to environmental decision-making.
., 2021; Ruths & more broadly provide an example of how to build, test and apply AI to understand recreation and other types of CESs.convolutional neural network, cultural ecosystem services, environmental management, image recognition, machine learning, open source, recreational activities, social media Pfeffer, 2014).Without comparing results from AI models to results from other methods of elucidating CES values such as public participatory geographic information systems (PPGIS) (Chollet, 2015;the training on a workstation with an eight-core CPU (AMD Ryzen 2700x) and 64 GB memory.The classification and prediction Python scripts were based on Tensorflow and Keras APIs(Chollet, 2015; Martín Abadi et al., 2015)running on a single NVIDIA Model performance for top-1 predictions in the Middle Fork region using photographs randomly sampled with stratification (n = 742).The Support column gives the number of test photographs which truly belong to each activity, not the number of photographs which were selected for manual evaluation Model performance by activity class in the Middle Fork (accuracy = 0.71) and in the Mountain Loop (accuracy = 0.60) regions.Activity classes are ordered by their F 1 score in the Middle Fork region; values are provided in Table2and TableS3.Minor classes horseback riding and trail running have no values as none of the test images were correctly classified into these categories.Fishing was evaluated only in the Mountain Loop region (n = 1) TA B L E 2