This week was a bit more difficult to communicate with my teammate since by now our fall semester has started. We created an outline for our final paper, and decided who was going to do each section. As I was writting down my parts, I realized how it would not be possible to finish the paper in a week, so I talked to my mentor about it. As soon as we complete our first draft, we will share it with our mentor to receive feeback and make the necessary changes.
As we approach the end of our program, we are wrapping up our project and creating the last visualizations. Using the predict values from the WETA and covariates dataframe, we visualized these values by plotting them according to latitude and longitude. We ended up with a map of Oregon state. Our predictions showed how it is more likely to detect species in the SouthWest part of Oregon which is surrounded by national forests (Suislaw, Williamette, Umpqua). Our data size was of 9,170 observations. Using the same steps we took in week 8, we utilize data from our mentor Eugene from all parts of Oregon State and visualize the predictions of said data. Even with this new data, SouthWest is the best place to detect species.
Sites were set according to the latitude-longitude of each observation; if two sites had the same latitude-longitude then their site ID will be the same. To do so, using the function unique(), we removed all duplicated sites with the same latitude-longitude from the Weta-covarites dataframe. After doing this, we numbered all unique observations, and merged this data frame with the Weta-covarites dataframe. With this, we completed our second task of creating a site column. We decided to train our data next. We splitted the data into 80%-20% randomly and used the 80% part to make predictions. We decided to only use unique sites so we removed all duplicated sites, however this brought up one challenge. The csvToUMF() fucntion from the UNMARKED__ package expects to have at least two observation columns, and we only had one. At first we thought we could use the duplicated sites qand merged them to have multiple observations, however, each duplicated site varies in the number of duplicates it has. While some may only repeat two times, other repeat more than 20 times. Our approach was creating an empty obeservation column to avoid genarating simulated data. We also realized that the fucntion only need the necessary variable columns, what I mean by this is that originally our training data had about 41 column, only from which we will only be using 10 to get predictions. So we decided to deleted the 31 columns that were not needed and finally we were able to get some estimate predictors. Both my teammate and I are starting our next semester of college next week, so we are wrapping up our project and starting to write a final paper which will be shared with one of our mentors Mark Roth, where we can hopefully give him a good insight of our findings and help him continue with this project.
This week we decided we wanted to test how the estimates look when we increment the number of sites and reduce the number of visits. We started with 200 sites and 16 visits and ended with 1600 visits and 2 visits. We find out that, detection seems to be better with 400 sites and 8 visits. Moreover, detection is not good with only two visits. For occupancy, 800 sites and 4 visits was the best case escenario. We relaized that there isnt a pattern between occupancy and detection where both have the same number of sites and visists and both are the lowest. This week we moved on from simulated data, and started using the eBird data. The first task, was to merge two files into one: WETA and covariates file. To do this, we tested which variable will be the best to join both files. Our options were: locality_id, checklist_id, latitude and longitude. With locality_id, when we merged WETA file( size: 13010 observations after filtering) and covarites file(size: 9170 observations) we ended up witha file of size 1, 097, 769 with 5256 unique locality_ids after filtering. With checklist_id we ended up with a file of size 9170 after filtering, so we decided this would be the data to be used. This week we want to focus on figuring out what do we want to measure about the results. To do this, we first decided to explore the data by visualizing it and understand it. Our next task, is figuring out how do we want to group to observations to make a sites. Do we want to have each observation to have its unique site, do we want to follow the eBird best parctices paper and make sites based on observer ID, lat/long, number of checklists, or do we want to create sites based on latitude and longitude.
This week with our second mentor Eugene, we reviewed our last two experiments. On a shared Google docuement we made a summary of each experiment, and what we learned from them. **Trade-off between site # and visit # **__ **Experiment Design - **
- Assuming closure holds, what trade offs should we consider # of sites and # of sites/visit
- We used one generated ground truth dataset to repeatedly change the # of visits and sites by partitioning the data.
- We start with a site of size M and 16 visits, we then double the number of sites (2M) and half the number of visits, to ensure the number of ‘datapoints’ stays the same even while we are changing the grouping of the ‘datapoints’.
- Simulation settings: 200 sites / 2 visits 100 sites / 4 visits 50 sites / 8 visits 25 sites / 16 visits
- We will run with 4 different simulation settings, with 10 replications of this experiment, and with 4 different sites M.
This week my teammate and I worked on seperate ways to create a different simulation. This time we wanted to fit an occupancy model into a “wrong” dataset to compare the estimates with correct datasets. For this simulation, I worked on creating a fucntion that would create the “wrong” dataset. To explain what the fucntion did I will use an example. Let’s say we have a dataset with 2 covariates, elevation and temperature, 100 sites and 4 visits. The dataframe would be structured in the following way: site (1:100)- y.1-y.2-y.3-y.4-temperature-elevation- date.1-date.2-date.3-date.4. What my fucntion does is to cut the number of sites into half, and double the number the visits, the covariates values would be calculated by taking the mean of two sites. So after passing the function, the dataframe would look like: site (1:50)- y.1-y.2-y.3-y.4-y.5-y.6-y.7-y.8-temperature-elevation- date.1-date.2-date.3-date.4-date.5-date.6-date.7-date.8. After running this simulation we encounter some interesting results: first: for detection there is a big difference in error between using the correct and wrong dataset. By merging these to one site with (1,1,0,0) observations, the model will incorrectly estimate the parameters that link between site features and occupancy status. second: to get a better estimate for occupancy we must use less sites. For example between 200sites/2visits and 100sites/4sites, the estimate was much better when we had 100 sites and 4 sites with both correct and wrong datasets. The key difference between the two simulation experiments is the first work sees the trade-off between site # and visit # by slicing the ground truth data generated, while the second work makes the trade-off by merging two true sites to one site (conducting wrong site clustering).
This week we worked on answering our research question: Assuming clousure holds, what trade offs should we consider about the number of sites/visists? To solve this questions we created a simulation using our previous code from week 2 to generate data and manipulate it. Our goal was generating one dataset and reuse it multiple times by changing the number of sites and visits each run. The biggest challenge we encounter was learning how to divide the data while being correctly formatted and without changing it. Originally we created a big dataset with 200 sites and 16 visits. We created a fuction that would patition our data in the following way: 200 sites and 2 visits, 100 sites and 4 visits, 50 sites and 8 visits, 25 sites and 16 visits. At first I thought our 4th model (25 visits and 16 visits) would have the smallest RMSE value for both detection and occupancy. The more recorded visits, the best occupancy models´estimates would be. For my suprise, the results were a bit different. For occupancy, the best model was with 200 sites and 2 visits while for detection the best model was with 100 sites and 4 visits. Our next step is to test wheter this results change when we add more covariates to different models.
This week my teammate, Demetrius, and I worked on generating data to create occupancy models so we could start getting a bit more understanding on how to deal with data and how to work with the functions included in the Unmarked package. Our goal was to answer the following question: Are the parameters better with 100 sites and 2 visits, or 150 and 3 visits? At first, we were having difficulty trying to make our code to work since every time we would try to run it we would encounter an error, “R Session Aborted- R encounter a fatal error. The session was terminated.” We were able to surpass this error, however we are still not sure what was the cause of it. Once we had our data generated correctly and created our occupancy models, we learned about Root MeanSquare Error(RMSE) and applied it to our models. The smaller the RMSE was, the better estimancy we had. We also wanted to improve our code so we could run it multiple times with different variables without having to change many lines in the code. So we decided to surround our code in a loop. After seing how much we struggle getting this first task to work propery, we discussed about wanting to improve our code a bit more and make it available to others and facilitate struggles they may encounter. Our last task of the week was review one of our mentors, Mark Roth, code where he created an occupancy model using eBird data. This code was similar to what we did during the week, but more complex. By the end of the week, our mentor Professor Hutchinson tols us to think about what topics we were the most interest about to continu working during the rest of the summer.
After our long holiday weekend, we resumed our readings by finishing analizing Estimating Site Occupancy Rates When Detection Probabilities Are Less Than One by Darryl L. Mackenzie. This time, our mentor explained us how covariates are added to the occupied sites formula, and how it varies depending on the number of covariates. Our second task of the week was following the Overview of Unmarked documentation (found here: https://cran.r-project.org/web/packages/unmarked/vignettes/unmarked.pdf ) to start understad how to create occcupancy models with R. The “Unmarked” package can be downloaded from RStudio, and it includes various methods that help the user to estimate site occupancy, abundance, and density of animals. The package, includes a dataset that can be used to practice before using real data. I tried to find information about the dataset included in the package to be able to understand the numbers I was getting by using the formulas from the documentation. Unluckily, I could not find anything that would help me better understand. Our mentor gave us the task to create our own data to help us understand better.
During this week, I was giving multiple papers related to the project I and my peer, Demetrius Hernandez, will be working on during the summer. The first paper we were giving is named : On the Role of Spatioal Clustering Algorithms in Building Species Distribution Models from Community Science Data by Mark Roth. The link to his presentation about this project can be found here: https://recorder-v3.slideslive.com/?share=42219&s=5bd64fd0-d8ce-4c16-b0a3-7c4a68d28a55 This paper gave us an insight of the kind of data we will be working on and the different challenges we may encounter during our research. We were first introduced to Ocupancy Modelling, which “allows to simulteously estimate the probabilty that a species occupies a location and the probability that the observer detects the species given that it is present.” This is what we will be the focus of our final project. In this paper, it is suggested to “defining a site as a set of two to ten checklists(list of species recorded during one period) submitted by the same observer at the same exact latitude-longitude coordinates” for a analysis of the data.