Wednesday, March 14, 2012

Basic Primer on Population DNA Genetics

Basically a primer meant to help understand a few blog posts on Sri Lankan Population Genetics I plan to be writing in the near future. Click here for the Latest list of  Sri Lankan Population Genetics Posts.

The genome is the entirety of an organism's hereditary information. It is encoded either in DNA or, for many types of virus , in RNA . The genome includes both the genes and the non-coding sequences of the DNA/RNA. The haploid human genome (23 chromosomes ) is estimated to be about 3.2 billion base pairs long and has about 20,000–25,000 distinct genes.

The gene is the molecular unit stretches of DNA and RNA which are the heredity of a living organism. A modern working definition of a gene is " a locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions, and or other functional sequence region . Each gene has a specific location ( locus ) on a chromosome and may come in several forms (alleles). In simpler language the DNA that gives Brown Hair, the DNA that gives Black hair are both alleles of the hair color gene (more info here). A more detailed non technical introduction to the genomes and genes, is Introduction to genetics).

Genotype vs. Phenotype:
Very important concept. Even if the genes are identical (genotype) the outward expression / looks (phenotype) could be different. Example would be identical twin, who have the same genes will have differences and fingerprints will be different. Another example would be children of short parents (and also have the sort genes) could be taller because of better nutrition.

The opposite is also true in that just because outward appearance is similar (phenotype) the genes (genotype) do not have to be similar. Example: Africans and Papua New Guineans though superficially similar are about the furthest apart genetically.
African
Papua New Guinean

What Kind of Genetic Tests: (see here for more non technical info)
Humans have 22 pairs autosomal non sex related chromosomes. The other is a pair of X and Y, XX in the case of a female, XY in the case of a male. These are the tests currently available.
  • Y-DNA; This test is only for males (X-Chromosome) and gives your direct male ancestry. i.e. the genes that were passed down from, your fathers, fathers, father etc. Currently about 67 markers are tested.
  • mt-DNA: Males and Females can be tested. This gives the genes in the mitochondrial cells that were passed from your mothers, mothers, mother etc.
  • Autosomal: Males and Females can be tested. This tests the 22 pairs of autosomal non sex related chromosomes. Currently about 0.024% of about 3 billion base pairs are tested.
Assume there is genome that is Pure Sri Lankan and Pure European. Three generations ago the maternal great grand mother (DF=Direct Female) and Paternal great grandfather (DM=Direct Male) were Pure Sri Lankan. The maternal sides daughters and paternal side sons always marry Pure European.
When the Person gets tested the Y-DNA and mt-DNA test will show that they are Pure Sri Lankan. However the autosomal test will show 25% Sri Lankan and 75% European. That because 3 generations there were 8 great grand parents (2number generations=23=8). 2 were Sri Lankan (2/8=25%) and 6 were European (6/8=75%).

Testing and Results
I have almost no clue as to the steps between sending your saliva and getting your genome data. There is DNA amplification, get more copies of the same DNA from the small sample sent. Then probably analysis thru machines like the Illumina which are like 10th generation HPLC's ( High Pressure Liquid Chromatograph ).
Depending on the kind of machine (chip), then different parts of the genome (~0.024% of 32 billion base pairs) gets tested. That means when comparison and analysis of results from different machines needs to be done, then the common tested locations need to be extracted before analysis can be done. Say for example you got your autosomal tests done at FTDNA and you are submitting the results to a research group that has mainly 23andMe results. Then the researcher will have to extract the data common to both FTDNA and 23andMe before any analysis can be done.

Anyway once the analysis is done you will get a whole lot of results, ranging from heath to ancestry. Other than that you also get approx 5mb data file in text format.
What Can be done with the Raw DNA data
a) There is SNPTips a free Firefox extension that will automatically match your genotype SNP's with others.
b) You could analyse it at http://snpinfo.niehs.nih.gov/snpfunc.htm. Use rsid from example above and then paste into SNP Function Prediction or SNP Information in DNA Sequence see results. For SNP Function Prediction you need to click some of the boxes like "Based on Genotype Data from dbSNP" say Asian. I have no clue as to what the results mean, its going to be a learning curve.
d) Do more research yourself or participate (anonymously if you wish) in many of the projects, such as HarrapaDNA (for South Asian analysis), Dodecad Ancestry Project and Eurogenes Ancestry Project.

Data and Analysis
This section focuses on general outline of data preparation and analysis of the raw autosomal data.
The 5 mb raw autosomal data (from 23andMe) will look like below.
rsid       chromosome position genotype
rs3094315  1          742429    AG
rs12562034 1          758311    AG
rs3934834  1          995669    CC
rs9442372  1          1008567   AG
rs3737728  1          1011278   AG
rsid or SNP: Typically only 0.024% (still thousands) of SNP are tested at locations (positions) known for genetic diversity (there are about 3 billion base pairs).
chromosome: The chromosome number of the 22 pairs autosomal non sex related chromosomes.
position: Position (also called locus or marker) of the place tested..
genotype: The base pair (or alleles), one from each strand in double helix. Each will be one of the four bases that make the DNA A (adenine) , G (guanine), and T (thymine).

To do analysis and comparison for genetic affinities. (See here for an in depth description of using ADMIXTURE at Razib Khan's Gene Expression and Anderson et al (2010) Data Control..(complete pdf).
Note: There is a software program ADMIXTURE and admixture the process of mixing of genes.

  1. Your raw data is combined with thousands of other genomes, some available freely from studies and others handed over by people who have got their genome test.
  2. Software like plink is used to do genome association analysis. Additionally it is used to create standard file formats that are used as inputs to other genome analysis software such as ADMIXTURE
  3. ADMIXTURE's input is binary PLINK (.bed) or ordinary PLINK (.ped and .map). The plink .ped file contains the genotype information (which SNP variants are where) and the .map file is essentially a list of the SNP names.
  4. To use ADMIXTURE, you need an idea of K, your belief of the number of ancestral populations.
  5. You can run ADMIXTURE in regular mode or supervised mode. Supervised mode is essentially anchoring ouptut to some reference populations. The reference population can either be autosomal data of real individuals or zombies. Zombies are recreated data to reference a hypothetical genetically pure indidividual or population (genomes created using the --simulate option of plink from allele frequencies) .
Important Caveats
  • Admixture analysis cannot distinguish between recent and ancient gene flow or directionality of flow
  • It is important to recognize that regions of highest haplo group frequency are not necessarily representative of origin. An obvious example is haplo group C, which displays its highest frequency in Polynesia (Kayser et al. 2000), but Polynesia is one of the last regions known to be colonized by modern humans (Sengupta et al, 2005).
  • Linguistic and Cultural (possible proxies for "Race") groups may not be the not the same as the genetic grouping.

No comments:

Post a Comment