poseidon-dev/RethinkSamplingFunctions.Rmd at master · Bai-Li-NOAA/poseidon-dev · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
---
title: "Rethinking sampling functions to include weights and numbers"
author: "Sarah Gaichas"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```

## Introduction: the problem

The original design of [atlantisom](https://github.com/r4atlantis/atlantisom) conceived at the [2015 Atlantis Summit](https://research.csiro.au/atlantis/atlantis-summit/) in Honolulu, Hawaii, USA.had the following workflow:

1. apply `run_truth` to a set of Atlantis model outputs, which returns 'atoutput' components in the following units by species, agecl (Atlantis cohort, often in units of maxage/10), polygon (model box), layer (model depth), and time (model toutinc or outputstep):
    + `biomass_eaten$atoutput` is eaten biomass in t for each functional group?
        + `biomass_eaten$vol` is volume aggregated over layers
        + `biomass_eaten$prey` is the prey item eaten?
        + `biomass_eaten$dietcomp` is proportion in diet?
        + `biomass_eaten$bio_eaten` is different from atoutput but how?
    + `biomass_ages$atoutput` is  biomass in t from `calc_biomass_age`
    + `catch$atoutput` is fishery catch in numbers of individuals
    + `nums$atoutput` is numbers of individuals
    + `resn$atoutput` is Nitrogen reserve weight in mg/m^3
    + `structn$atoutput`is Nitrogen structural weight in mg/m^3
2. additional components of the `run_truth` return (default name `result`) are key factors needed for conversions and interpreting the outputs (described elsewhere):
    + `biolprm`
    + `fgs`
3. apply `create_survey` to the result output of `run_truth`
4. apply `sample_survey_biomass` to the output of `create_survey` to get an index
    + designed to use nums output with an average weight--but estimating average weight proved problematic the way I did it, which was too imprecise
    + can be used directly on biomass_ages output with a vector of average weight=1 (see [True Bio Test](https://sgaichas.github.io/poseidon-dev/TrueBioTest.html))
5. apply `sample_fish` to the output of `create_survey` to generate biological samples
6. apply `calc_age2length` to output of `sample_fish` to generate length compositions
    + *FIRST BIG PROBLEM!* `calc_age2length` needs weight data at the same level of resolution as the nums data to estimate lengths using the length-weight relationship.
    + Weight data comes from resn and structn. `calc_age2length` was originally tested with completely disaggregated outputs of `run_truth` and it generates length samples correctly that way
    + Initial `calc_age2length` testing with a real dataset revealed that it is entirely too slow to apply at the level of `run_truth` results for a full Atlantis model run and over many species (TrueLengthCompTest.Rmd)
    + Attempts to aggregate structn and resn results to the `create_survey` and `sample_fish` level were incorrect, because:
    + *we cannot apply functions designed for numbers output that aggregate based on sum to resn and structn outputs which are in units of N per volume*
    + So actually, this problem reaches all the way back to `create_survey` which is designed for nums output and aggregates using a sum:

```{r createsurvey}

knitr::include_graphics("https://dm2301files.storage.live.com/y4mSPipdQ7jyZNNfmpslxt7sCy7DqcQ0wMM40iLU1LLwfprFtKrjNJtHdibguUzNb9qY2vqluaHtpnKsUqkJ_wk3PpGCBtZmLGMdDSF1TVPsjeRWwtGjx8AIdNDiw9xvGmXf4aMnjXJ9DK9QW63JRZusH2uKgO1Mjd-8AEY554sLUG9xC9NFGaktffG2Ep5y8sz09stoHEM2pqG1I_oNtHGbg/CreateSurvey.png?psid=1&width=960&height=720")

```


The next three biological sampling steps after creating the length composition are taking a subset of the `sample_fish` for lengths using `sample_length` (though this could be 100% sample as is often done on a survey), taking a subset of the `sample_fish` for ages and applying ageing error using `sample_ages`, and finally taking a sample of individual fish weights, perhaps the same sample that was aged as is done on some surveys) using `sample_weight`. (So far only `sample_ages` is drafted and the other two functions remain to be written.)

## May 15: True age estimation needs to happen in here?

The sequence `create_survey` to `sample_fish` to `calc_age2length` now works. It can go as pictured below if there are not multiple true ages in an age class. Revisions to work on:

1. For clarity, `calc_age2length` should be renamed `calc_stage2length` if we are planning to use it directly between Atlantis cohorts and lengths as tested here.
2. Between `sample_fish` and `sample_ages` we need the already written `calc_stage2age` function, which calls `calc_Z`. Test these next. But see below.
3. Similarly, between `sample_fish` and `sample_weight` we will need to expand to true ages for a weight-at-age vector (for species where stage=age, this is already output as muweight from the `calc_(st)age2length` function).

BUT `calc_stage2age` is designed to work on true outputs (it makes no sense to expand age classes from survey-selectivity affected output). In theory this should be integrated into `run_truth` but I am wary of introducing this approximation so far upstream. On the other hand, what choice to we have if we want true age outputs? We either need the direct output from the most recent codebase(s) or we have to do this approximation to get age input to stock assessment models.

So, lets try integrating `calc_stage2age` into `run_truth` as planned and we can compare the outputs to what we already have for the stage data. We will want to compare census biomass, survey biomass, length comps...

    + write another truth function that works on the Atlantis true age output?
    + load YOY file into run_truth (uses load_YOY function)
    + do we have to apply before calc_biomass_age, and sub output of calc_stage2age in for nums?
    + if so, how do we deal with the structn and resn portions?

    May 16: Discussion with Isaac and Patrick: how to split weights to true age bins?
    + Simple interpolation? within a year or across years
    + Is Z calc lagged? doesnt look like it, simple within year Z like catch curve
    + Check whether there is a plus group in any Atlantis model--shouldn't be
    + First try a linear interpolation between weight groups
    + This may be more wrong for smallest fastest growing age classes
    + If so we can write something else for younger classes
    + Midpoint of weight for each age is the goal

Then we need to revise `calc_age2length` to work for true ages, as it is now hardcoded to expext 10 age classes.

If we do those two things, this sampling flow remains the same, but has `calc_age2length` as the intermediate step before lengths and weights, which will come out from the true age classes.

```{r samplingworkflow}

knitr::include_graphics("https://dm2301files.storage.live.com/y4mV1y2g-JoSJFProk32x9SjWofLHwF_WtQc4Ipz8M2ZLc7hypg4yTYAgvB0IjwVLfjVw4fCNFP9l6lbWQIFI8NtzaIHjxgGEmptz9I9cB5jA5VUCJ_gJmSYSfI8_au29U_nuP5-zw-qkBBINifEo8RlJSvOEol5qu5IfI4BLk5YjjxTomMcbDVtOUBIJQHMEZjRUnlPhy7lV4x5ckvz5WfWA/SamplingWorkflow.png?psid=1&width=960&height=720")

```

## Proposed solution (now done for calc_age2length)

I think we should keep the workflow designed at the Atlantis summit because this is the way surveys generally operate.

We need a way to appropriately aggregate the resn and structn outputs which are densitites. A mean or median across layers and polygons would be appropriate. The `run_truth` function calls `calc_biomass_age` which uses median to aggregate these outputs to generate the biomass_ages result component.

We can either rewrite `aggregateData` to take another argument indicating aggregation by sum or median, or create separate functions `aggregateDataSum` and `aggregateDataMedian` to be used on the numbers/biomass or density outputs, respectively.  Either way we need to ensure that the appropriate aggregation function is used in `create_survey` and `sample_fish`

Current `aggregateData` function, which is strangely not called by `sample_fish`. Or not so strange, since `aggregateData` defaults to aggregate only over layer and maintain polygons in the dataset?

```{r aggregateData}

knitr::include_graphics("https://dm2301files.storage.live.com/y4mlnKwEPKB-fwO50OZng5CBgAwpz_T3Wc_K8BEdyLLAt-WnSs8Qd4YJY1i4hO0Zja9F6Vx1X3WiIDB1pFlBVW9YQDfKGvczcddyoZbl_bvUmPOsdGN4eHiAgyh_rKZ_Op286QWAPURty0QsJsNOhUNzBAc8326d8cKW9QmTlsncAJDDoU5wMdJgWRoalD939BG4Pg2SbyLxBe9tLfYm1xD_g/aggregateData.png?psid=1&width=960&height=720")

```

While we are at it, catch biomass is apparently calculated by `run_truth` but not exported, so I think we should export that all at once. Right now only catch in numbers is exported, but catch biomass is most often what management is based on and is an input to most assessment models.

For fisheries, the plan would be to run `create_fishery_subset` similar to `create_survey` and then `sample_fish` with the same functions as above to get the catch at length and catch at age samples.

Current `create_fishery_subset` function, as yet untested:

```{r createfishery}

knitr::include_graphics("https://dm2301files.storage.live.com/y4mZYTQWQxl_SZg2SQVNtRiv5OjAyv7ML3FqiVW2dY0SZmUlUO91Mm55pNAamQ-Kz7V_p2dH2UrEIGVttXXmiiN4L6HOBJTexTtqu43r9VH4IOdc-XPmsACvcnmzCrUIjqvr6NPJGwpds4_KxLEvlnfGJ3ZobYsxKwIUhObm6Uyq2deLPjZUuMT06YTELfTWbZBcDIR42waSKhxpUKlQXiaIw/CreateFisherySubset.png?psid=1&width=960&height=720")

```

## To Do List:

1. write an aggregation function for resn and structn that applies the median
    + **Done 10 May, `aggregateDensityData`**
2. rewrite `create_survey` to use alternate aggregation functions, or write `create_survey_nums` and `create_survey_wt` functions that use appropriate aggregation
    + the latter may be better, because do we want to apply efficiency and selectivity to median weight densities? I don't think so?
    + the fish still weighs the same, we just don't have the right proportion in the ecosystem (efficiency) or the right proportion at age (selectivty), both of which are taken care of in the numbers.
    + so maybe `aggregateDensityData` using the standard survey polygons gets us what we need as input to `sample_fish`?
    + **Yes, did the latter was done**
3. rewrite `sample_fish` to use alternate aggregation functions for each input type. Keeping this all in one function makes sense because the biological sampling already needs both numbers and weights as inputs
    + **Done 10 May, use sample=FALSE option**