Week 5 July 26th- July30th
This week my teammate and I worked on seperate ways to create a different simulation. This time we wanted to fit an occupancy model into a “wrong” dataset to compare the estimates with correct datasets. For this simulation, I worked on creating a fucntion that would create the “wrong” dataset. To explain what the fucntion did I will use an example. Let’s say we have a dataset with 2 covariates, elevation and temperature, 100 sites and 4 visits. The dataframe would be structured in the following way: site (1:100)- y.1-y.2-y.3-y.4-temperature-elevation- date.1-date.2-date.3-date.4. What my fucntion does is to cut the number of sites into half, and double the number the visits, the covariates values would be calculated by taking the mean of two sites. So after passing the function, the dataframe would look like: site (1:50)- y.1-y.2-y.3-y.4-y.5-y.6-y.7-y.8-temperature-elevation- date.1-date.2-date.3-date.4-date.5-date.6-date.7-date.8. After running this simulation we encounter some interesting results: first: for detection there is a big difference in error between using the correct and wrong dataset. By merging these to one site with (1,1,0,0) observations, the model will incorrectly estimate the parameters that link between site features and occupancy status. second: to get a better estimate for occupancy we must use less sites. For example between 200sites/2visits and 100sites/4sites, the estimate was much better when we had 100 sites and 4 sites with both correct and wrong datasets. The key difference between the two simulation experiments is the first work sees the trade-off between site # and visit # by slicing the ground truth data generated, while the second work makes the trade-off by merging two true sites to one site (conducting wrong site clustering).