In bioinformatics and biology, we often want to know whether an observed difference between two groups is real or could have occurred by chance.
Examples include:
- gene expression levels in treated vs untreated samples
- GC content in two sets of DNA sequences
- protein lengths in different functional categories
In this project, you will implement a permutation test, a general, assumption-free statistical test that estimates significance by simulation.
After completing this project, you should be able to:
- Explain what a statistical hypothesis test is
- Define a null hypothesis and a test statistic
- Implement a permutation-based null model
- Estimate p-values from simulated data
- Interpret statistical results in a biological context
You are given two groups of numerical measurements:
- Group A: measurements from condition A
- Group B: measurements from condition B
Your task is to test whether the observed difference between the two groups is statistically significant, using a permutation test.
You will not use any built-in statistical testing functions. Instead, you will implement the test logic yourself.
The permutation test answers the question:
If there were actually no difference between the two groups, how often would we observe a difference at least as large as the one we see?
To answer this, we:
- Compute a test statistic on the real data
- Randomly shuffle group labels many times
- Recompute the statistic for each shuffle
- Compare the observed statistic to this null distribution
You will use the difference in means as the test statistic:
Other statistics are possible, but this one is required.
The null hypothesis states:
The two groups come from the same underlying distribution.
Under this hypothesis, group labels are arbitrary.
A permutation consists of:
- pooling all measurements
- randomly reassigning them into two groups of the original sizes
-
Two lists of floating-point numbers:
groupAgroupB
-
An integer
N, the number of permutations (e.g. 1000)
- The observed test statistic
- The estimated p-value
- (Optional) the null distribution of test statistics
Implement a function that computes the difference in means between two groups.
Implement a function that:
- pools the data
- randomly permutes it
- splits it back into two groups of the original sizes
Repeat the permutation step N times and compute the test statistic for each permutation.
Compute the p-value as:
(This correction avoids zero p-values.)
Report whether the result is statistically significant for a given significance level (e.g.
Input
Group A: [5.1; 4.9; 5.3; 5.0]
Group B: [4.2; 4.4; 4.1; 4.3]
N = 1000
Output
Observed difference in means: 0.775
Estimated p-value: 0.012
Result: significant at alpha = 0.05
(Note: exact values will vary due to randomness.)
- Use random number generation provided by F#
- Use immutable data structures unless mutation simplifies the code
- Focus on correctness and clarity
- Make your code reproducible by fixing the random seed
- Add a one-sided test
- Allow different test statistics (median difference)
- Visualize the null distribution (histogram)
- Apply the test to a biological dataset
- Implement Benjamini–Hochberg correction for multiple tests
Submit:
- Source code
- A documentation explaining your approach
- One example dataset with expected interpretation