Statistical Exploration

Statistical Exploration
This section describes our initial efforts to use a variety of statistical procedures to analyze the atlas project data. Results are reported in the species accounts. We anticipate continuing this statistical exploration after this CD is released, and welcome collaboration with other volunteers who have the time, software, skills, statistical knowledge, and interest to join us in this analysis.

A major objective of the data analysis has been to identify non-random associations between presence of particular species in a hexagon or square (during any breeding season, 1995-1999), and various measures of observer effort, climate, and habitat type (including habitat diversity and pattern). The data analysis certainly cannot establish cause-effect relationships between these variables and the presence or absence of a species, but a thoughtful analysis can identify useful indicators and empirical rules for predicting species presence or absence at a regional scale. The data analysis strategy was developed by Paul Adamus in consultation with Fred Ramsey (Department of Statistics, Oregon State University), and was implemented by Carolyn Krueger using the NCSS software for Logistic Regression, and the Knowledge Studio software for Classification Tree Analysis. The NCSS software also was used by Paul Adamus to implement the Stepwise Regression analysis. The narratives describing and interpreting the results from using these three statistical procedures were written by Paul Adamus.

The following narrative briefly describes the three statistical procedures and how we implemented them with this particular data set. This narrative is written primarily for persons with a working knowledge of formal statistical analysis procedures. A good technical description of Logistic Regression can be found in the book, The Statistical Sleuth, by Fred Ramsey and Dan Shafer.

Logistic regression is a procedure that identifies the combination of independent variables that best explain variance in the binary dependent variable. In this case, the binary dependent variable was a species observation, which was coded as 1 (observed, at any level of breeding confirmation) or 0 (not observed) for each species, for each hexagon and for each square. Logistic regression, rather than a simple correlation analysis or a nonparametric paired-t test (e.g., Mann-Whitney U), was used because logistic regression is a non-linear procedure that is less affected by non-normality in the statistical distributions of the independent variables. No transformations were applied to any variables analyzed by the logistic regression. Three tasks were performed using NCSS�s forward hierarchical logistic regression analysis:

1. The binary species response was regressed against 74 climate variables (the PRISMHX file), then against 61 habitat type variables (the HABHEX file), and then against 14 observer effort variables (the EFFORTHX file) (see data files). During each analysis, we did not specify any interaction terms among the independent variables. In each group of independent variables, all variables with Wald-Z values less than - 4 or greater than +4 were identified as "highly significant," and those with Wald-Z values of -2 to -4 or +2 to +4 were termed "significant." This analysis was run twice, first including data from all 430 hexagons, and then including just the hexagons that contained at least 41% land area in Oregon (other analysis had determined the 41% threshold to be a meaningful break point). Variables with significant Wald-Z values from either analysis were brought forward. If the set of variables that were significant for a particular species did not include any climate variables, some non-significant (by our definition) climate variables were brought forward for that species, such that all future analyses would include both habitat and climate variables. These "non-significant" climate variables in most cases had p values of <0.05.

2. A second round of logistic regression was run using habitat variables derived from a different source. The alternative 18 habitat variables were mostly from a computer-supervised interpretation of AVHRR satellite imagery, as opposed to the 61 habitat variables used in the above analysis which were from the ONHP�s interpretation of Thematic Mapper satellite imagery. The logistic regression of the alternative habitat data set was run once including all hexagons, and then again using just the hexagons that contained at least 41% land area in Oregon, and variables with significant Wald-Z values from either analysis were brought forward.

3. Finally, logistic regression was performed on data from the squares in a similar manner. Squares surveyed for 3 or fewer hours, or with incomplete habitat or climate data, were excluded. As before, all the variables with significant Wald-Z values, plus a few with "non-significant" climate variables if necessary, were brought forward. Data were unavailable for a few of the many climate variables used in the hexagon analyses.

This is a type of procedure that identifies the hierarchical and interactive combination of independent variables across specified numeric ranges that best explain bird species presence or absence. The tool does not assume linearity in relationships among variables, nor a statistically normal distribution of the data. It examines statistical relationships between avian variables and potentially explanatory variables not only among the entire set of hexagons (as does logistic regression) but also among subsets of hexagons.

We applied pre-specified stopping rules in the analysis, rather than use time-consuming iterative cross-validations to "prune" the classification tree down to some optimal number of explanatory variables. Specifically, and for simplicity of interpretation, we specified that no more than 4 levels be generated in a hierarchy. Initial testing indicated that seldom were more levels justified, i.e., allowance of more levels resulted in splits that contained fewer than 5 members (hexagons) and thus probably had little predictive power.

We implemented the classification tree analysis using a data-mining classification procedure featured in the software program KnowledgeStudio (Angoss Software, Inc.). For each species, 3 classification trees were generated (the tree diagrams are accessible from the Anlys link in the species overview window). As with the logistic regression, the first round of analysis used the ONHP habitat data for the hexagons, the second used the AVHRR interpretation of habitat, and the third featured data from the squares. In each case and for every species, in addition to analyzing all variables we had analyzed in the logistic regression, we included decimal latitude and longitude, and annual precipitation.

The portion of the KnowledgeStudio software that we used implements what is known as a CHAID algorithm. CHAID stands for Chi-squared Automatic Interaction Detection, and (as its name implies) uses chi-squared tests for statistical association between dependent and independent variables. Although the CHAID algorithm has been used widely by data mining companies in the analysis of socioeconomic data, the results presented herein may represent the first time the CHAID algorithm has been used for analysis of ecological data. More often, a similar algorithm (termed CART, for Classification and Regression Tree) has been applied to ecological data sets. However, CART is limited in that it allows each node to be split into no more than two categories. For example, CART might split a species� geographic range into 2 categories ("wet" and "dry" regions), then split these further into 4 regions (wet-cool, wet-warm, dry-cool, dry-warm). In contrast, CHAID might create an initial 3-way split (wet, intermediate, dry) and subsequently split each of these categories into 2 or more subcategories. On each level, CHAID�s decision as to how many categories to create depends on how many it determines to be statistically supportable.

For analyzing species-environment relationships, we chose to use these two very different procedures (Logistic Regression, Classification Tree Analysis) in parallel rather than in sequence. Had we used them in sequence, we might first have used one procedure to narrow the large number of variables to a smaller number, and then used the other procedure to identify from the limited set the ones with the strongest statistical association with presence of each species. Instead, we applied both procedures to the full set of variables, and looked for commonalities among variables they independently identified as having a statistically significant association with species occurrence. Results of using different approaches may differ somewhat because of differences in underlying mathematical assumptions of each procedure, e.g., thresholds for statistical significance, differential sensitivity of formulas to particular data distributions. A major structural difference between the two procedures is that logistic regression analyzes the relationship of all independent variables to the dependent variable simultaneously, whereas classification tree analysis examines the relationship sequentially and recursively, i.e., one independent variable at a time. There is no evidence to support one procedure being more "correct" than the other, but in many ways classification trees seem more faithful to the sorts of contingent relationships among variables that exist in nature.

Several avian variables contain continuous (non-binary) values and thus were analyzed with stepwise regression rather than with logistic regression. These include the following dependent variables, calculated for both the hexagons and squares:

These were also analyzed using the classification tree analysis, but the resulting trees are not reproduced on this CD.

Forward stepwise regression was implemented after first applying log10 and arcsine transformations to data where appropriate. Sample sizes (n) were generally 430 for hexagon analyses and 410 for squares. The software ran up to 20 iterations of each regression, using a 0.05 probability to remove dependent variables and 0.20 probability to enter, with 0.015 specified as the minimum root mean square error required to remove a variable from the final model. Approximately 66 dependent variables were screened at the hexagon scale and approximately 20 at the square scale.

The results only address the possible role of habitat measured at a landscape (hexagon or square) scale on species presence. Because many bird species seem to primarily key into habitat measured at a local scale (a few acres rather than 160,000 acres), and because volunteers involved in this project were not required to record the local habitat of each observation they reported, the results of our statistical analysis are very approximate. Moreover, distribution of effort was not equal among atlas units, and observation time was not consistently reported. The atlas project was not designed specifically to answer questions about species habitat associations at any scale, and results should not be represented as demonstrating causal relationships between a species� occurrence and the presence of a particular habitat or condition.

The bird observations themselves (the dependent variable) are not a probability sample -- they represent a survey that is geographically complete at one scale (the state), but is neither complete nor probabilistic at another scale (within hexagons) due to access restrictions. These data are also likely to be spatially autocorrelated to a considerable degree, thus biasing estimates of statistical relationships between species, habitat types, and climate.

The independent variables used in this analysis -- those pertaining to weather and habitat -- included nearly all that were available in data sets generously provided to us by other researchers. This opportunistic approach to selecting independent variables for testing may fail to examine all variables potentially important to birds, and probably includes some fairly irrelevant ones. In general, we did not attempt to second-guess the quality of data provided to us by other sources, although it seems likely that some of the variables may have been measured inconsistently across Oregon (e.g., road density). For lists of variables we examined, see Data files.

Because the number of statistical comparisons we examined is large (275 species x 167 variables in hexagons = 45,925 possible combinations), as many as 1 in 20 of the associations that we found to be statistically significant (p = 0.05) might actually be attributable only to chance. Moreover, the independent variables we examined are not necessarily independent of each other, and some are multicollinear (are subsets of others). For example, the three measures of observer effort are highly correlated, as are many of the vegetation types, which in turn are correlated with many of the climate variables we used. The analytical procedures we used are capable of systematically sorting out these interdependencies to some degree, but some distortion may remain.

Due to time constraints, we did not attempt to use scatterplots to examine data distributions of all dependent-independent variable pairings. We also did not attempt to combine any variables -- e.g., big sagebrush and low dwarf sagebrush -- prior to statistical analysis, e.g., through use of cluster analysis, principal components analysis, other statistical techniques, or professional judgment. Doing so might have altered some of the statistical associations reported in our results.

Future analyses of the data might employ other statistical tools to associate particular species with particular environmental variables. These tools include but surely are not limited to:

MRPP (multiresponse permutation procedures), wherein presence and absence of a species are treated as two groups, and p-values associated with various sets of explanatory variables are considered to estimate their relative contribution.
NMDS (non-metric dimensional scaling), wherein a group of related species are ordinated in two-dimensional space using the Wald coefficients (from logistic regression) of an independent variable that is suspected of influencing the distribution of species in the group.
Mantel test, a procedure that can be applied prior to other analyses to account for the inevitable geographic autocorrelation among records of some variables.