Skip to content

Latest commit

 

History

History
200 lines (119 loc) · 4.03 KB

File metadata and controls

200 lines (119 loc) · 4.03 KB

Project: Statistical Hypothesis Testing Using Permutation Tests

Background

In bioinformatics and biology, we often want to know whether an observed difference between two groups is real or could have occurred by chance.

Examples include:

  • gene expression levels in treated vs untreated samples
  • GC content in two sets of DNA sequences
  • protein lengths in different functional categories

In this project, you will implement a permutation test, a general, assumption-free statistical test that estimates significance by simulation.


Learning Objectives

After completing this project, you should be able to:

  • Explain what a statistical hypothesis test is
  • Define a null hypothesis and a test statistic
  • Implement a permutation-based null model
  • Estimate p-values from simulated data
  • Interpret statistical results in a biological context

Problem Description

You are given two groups of numerical measurements:

  • Group A: measurements from condition A
  • Group B: measurements from condition B

Your task is to test whether the observed difference between the two groups is statistically significant, using a permutation test.

You will not use any built-in statistical testing functions. Instead, you will implement the test logic yourself.


Key Idea

The permutation test answers the question:

If there were actually no difference between the two groups, how often would we observe a difference at least as large as the one we see?

To answer this, we:

  1. Compute a test statistic on the real data
  2. Randomly shuffle group labels many times
  3. Recompute the statistic for each shuffle
  4. Compare the observed statistic to this null distribution

Definitions

Test Statistic

You will use the difference in means as the test statistic:

$$ T = \bar{x}_A - \bar{x}_B $$

Other statistics are possible, but this one is required.


Null Hypothesis

The null hypothesis states:

The two groups come from the same underlying distribution.

Under this hypothesis, group labels are arbitrary.


Permutation

A permutation consists of:

  • pooling all measurements
  • randomly reassigning them into two groups of the original sizes

Input

  • Two lists of floating-point numbers:

    • groupA
    • groupB
  • An integer N, the number of permutations (e.g. 1000)


Output

  • The observed test statistic
  • The estimated p-value
  • (Optional) the null distribution of test statistics

Starting Tasks

Task 1: Test Statistic

Implement a function that computes the difference in means between two groups.


Task 2: Permutation Generation

Implement a function that:

  • pools the data
  • randomly permutes it
  • splits it back into two groups of the original sizes

Task 3: Null Distribution

Repeat the permutation step N times and compute the test statistic for each permutation.


Task 4: P-value Estimation

Compute the p-value as:

$$ p = \frac{#{ |T_{perm}| \ge |T_{obs}| } + 1}{N + 1} $$

(This correction avoids zero p-values.)


Task 5: Interpretation

Report whether the result is statistically significant for a given significance level (e.g. $\alpha = 0.05$).


Example

Input

Group A: [5.1; 4.9; 5.3; 5.0]
Group B: [4.2; 4.4; 4.1; 4.3]
N = 1000

Output

Observed difference in means: 0.775
Estimated p-value: 0.012
Result: significant at alpha = 0.05

(Note: exact values will vary due to randomness.)


Implementation Notes

  • Use random number generation provided by F#
  • Use immutable data structures unless mutation simplifies the code
  • Focus on correctness and clarity
  • Make your code reproducible by fixing the random seed

Tasks extension

  • Add a one-sided test
  • Allow different test statistics (median difference)
  • Visualize the null distribution (histogram)
  • Apply the test to a biological dataset
  • Implement Benjamini–Hochberg correction for multiple tests

Submission

Submit:

  • Source code
  • A documentation explaining your approach
  • One example dataset with expected interpretation