A motif in a DNA sequence represents a short, recurring pattern that often has biological
implications. They often play role as specific binding sites for transcription factors and other
While investigating motifs and their sites generated using MEME and FIMO, I came across a trivial issue.
MEME is a tool for motif discovery while FIMO outputs the sites where a particular motif is found.
Examplei(this is a modified form of the original FIMO output):
GSoC 2015 application
period just got over couple of days ago. I received at least 3 requests asking
me for my previous proposals. This made me realise, that even though
they were always publicly available on Google-melange, the way melange works
makes such public proposals almost un-discoverable and probably un-indexable
Most of my proposals were written using Google Docs, since it makes life
easy for the mentors to track changes and comment. But they are still un-discoverable
in this form.
I decided to create a repository porting all my previous proposals to markdown.
pandoc has turned out to be really powerful. I didn’t want to just host my
own proposals so I tried keeping the format very generic. I sent out couple
of emails to past GSoC participants asking them if they would like to push theirs too.
Point #2: Ensembl uses the most recently updated human genome housed at the GRC.
This current major assembly release is called GRCh38. NCBI and UCSC use the same genome.
UCSC refers to the recent human genome as GRCh38/hg38. Before all that let’s see what these coordinate sytems
are all about. Blatantly copying from the SAM1 specification[http://samtools.github.io/hts-specs/SAMv1.pdf]:
1-based coordinate system: A coordinate system where the first base of a sequence is one. In this
coordinate system, a region is specified by a closed interval. For example, the region between
the 3rd and the 7th bases inclusive is [3, 7]. The SAM, VCF, GFF and Wiggle formats are using
the 1-based coordinate system.
0-based coordinate system: A coordinate system where the first base of a sequence is zero. In this
coordinate system, a region is specified by a half-closed-half-open interval. For example, the
region between the 3rd and the 7th bases inclusive is [2, 7). The BAM, BCFv2, BED, and PSL
formats are using the 0-based coordinate system.]
Let’s take an example here:
chrX | A | T | G | C | A | T | C | G |
0-based 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8
1-based 1 2 3 4 5 6 7 8
Given the examples the conversion is straightforward:
For nucleotides, SNP, DEL:
(1-based start) = (0-based start + 1)
(1-based end) = (0-based end)
(1-based start) = (0-based start)
(1-based end) = (0-based end + 1)
Why a 0-based system?
It makes the calculations simple:
Consider chrX0-2 ATC & chrX1-4 ATCG
Length of overlap = min(end1, end2) - max(start1, start2) = 2-0 = 2 Whereas for a 1-based system:
chrX1-3 ATC & chrX2-5 ATCG
Length of overlap = min(end1, end2) - max(start1, start2) = 3-2 = 1 => WRONG! Why? You need to add 1
It is intuitive if you think of the numbers between(including) 1 & 5 is 5-1+1=5! [1,2,3,4,5]
An important note for uploading data to UCSC Genome browser:
> If you submit data to the browser in position format (chr#:##-##), the browser assumes this information is 1-based. If you submit data in any other format (BED (chr# ## ##) or otherwise), the browser will assume it is 0-based.You can see this both in our liftOver utility and in our search bar, by entering the same numbers in position or BED format and observing the results. Similarly, any data returned by the browser in position format is 1-based, while data returned in BED format is 0-based.
Coming back to hg19 and GRCh37 stuff,
The GRCh37 assembly defintion is at: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_000001405.13.assembly.txt
Besides Chromosomes 1-22, there are these things called ‘unlocalized-scaffold’ and ‘unplaced-scaffold’.
Let’s head over to the definitions of the Consortium: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/info/definitions.shtml
A sequence found in an assembly that is associated with a specific chromosome but cannot be ordered or oriented on that chromosome.
e.g HSCHR17_RANDOM_CTG2 associated with chromosome 17, but cannot be oriented on it
A sequence found in an assembly that is not associated with any chromosome.
In the past, these tables contained data related to sequence that is known to be in a particular chromosome, but could not be reliably ordered within the current sequence.
ChrUn contains clone contigs that can’t be confidently placed on a specific chromosome.
UCSC’s hg19 has chrM as the mitochondiral DNA.
> Note on chrM: Since the release of the UCSC hg19 assembly, the Homo sapiens mitochondrion sequence (represented as “chrM” in the Genome Browser) has been replaced in GenBank with the record NC_012920. We have not replaced the original sequence, NC_001807, in the hg19 Genome Browser. We plan to use the Revised Cambridge Reference Sequence (rCRS, http://mitomap.org/bin/view.pl/MITOMAP/HumanMitoSeq) in the next human assembly release
In order to verify that they indeed have atleast the same chromosome lenths I hacked this simple script.
I will try to cover a detailed explanation of the following terms , which I have often found to be confusing, thanks to the lack of cohesiveness in Bioinformatics:
SAMv1 specification defines a ‘TLEN’ term in the alignment section:
> TLEN: signed observed Template LENgth. If all segments are mapped to the same reference, the unsigned observed template length equals the number of bases from the leftmost mapped base to the rightmost mapped base. The leftmost segment has a plus sign and the rightmost has a minus sign. The sign of segments in the middle is undefined. It is set as 0 for single-segment template or when the information is unavailable.
Going by this definition the ‘insert size’ should mean: The total length of the original piece being sequenced(without adapters!).
Let’s pick an example from Seqanswers, slightly modified here. Reads R_1 and R_2 are paired end reads, sequencing the same sequence from two opposite ends. We want to sequence the ‘Piece to be Sequenced’.
Piece to be Sequenced: |=====50=====|==========150==========|=====50=====|
Fragment=Piece + Sequencing Adapters: |~~20~~|=====50=====|==========150==========|=====50=====|~~20~~|
Reads(Paired End): |----------->| |<-----------|
Inner Mate Distance: |==========150==========|
Sequencing chops the DNA into shorter pieces, and these pieces are ligated with adapter sequences.Though we started with Piece to be sequenced it was ligated with sequencing adapters on both sides.
The original sequence that was supposed to be sequenced was (50+150+50=) 250 bp long. However while the DNA was cut, the ligation step added extra bases at both ends of this piece so that it is now (20+50+150+50+20=) 290 bp long and this ‘piece+ligated sequence adapters’ are called ‘Fragment’(This is the reason I chose the word ‘piece’ to denote the part of DNA to be sequenced). Generally sequencing begins after adapter sequences. This is the reason R_1 and R_2 do not cover the 20 bp adapter sequences. In any case I assume the reads have been trimmed to remove the adapter sequences. The 150 bp sequence that is left unsequenced here is the distance between mate pairs, or the Inner Mate Distance.