{Saket Choudhary}

Motif Directionality

| Comments

A motif in a DNA sequence represents a short, recurring pattern that often has biological implications[1]. They often play role as specific binding sites for transcription factors and other proteins.

While investigating motifs and their sites generated using MEME and FIMO[2], I came across a trivial issue. MEME is a tool for motif discovery while FIMO outputs the sites where a particular motif is found.

Examplei(this is a modified form of the original FIMO output):

chrom	start   stop    strand  score   p-value q-value sequence
chr19	8386315	8386328	+	100	3.04e-07	0.0008	TGGTATCGCGAGAC
chr19	8386317	8386330	-	100	3.05e-08	0.006	CCGTCTCGCGATAC	

Motifs are generally described using position weight matrix where each base A,T,G,C takes a value for 0 to 1, depending on how frequently that base occurs at that position.

If you ignore the strand information, the two sites are not in sync, which should be apparent from their start and stop coordinates.

the case of ‘+’ strand is pretty straightforward, but to decipher the start and stop cooridnates for ‘-‘ strand, required me to draw this diagram:

+  5'------TGGTATCGCGAGACGG-------3'
-  3'------ACCATAGCGCTCTGCC-------5'

To elaborate:

0 on the far left of + is position 8386315
0 and the far right of - is position 8386330
The arrows indicate the directionality of motif, yes there is one!


  1. [What are DNA sequence motifs?] (
  2. MEME Suite: Tools for motif discovery and searching


| Comments

GSoC 2015 application period just got over couple of days ago. I received at least 3 requests asking me for my previous proposals. This made me realise, that even though they were always publicly available on Google-melange, the way melange works makes such public proposals almost un-discoverable and probably un-indexable too.

Most of my proposals were written using Google Docs, since it makes life easy for the mentors to track changes and comment. But they are still un-discoverable in this form.

I decided to create a repository porting all my previous proposals to markdown. pandoc has turned out to be really powerful. I didn’t want to just host my own proposals so I tried keeping the format very generic. I sent out couple of emails to past GSoC participants asking them if they would like to push theirs too.

I hope this grows into one powerful archive:

GRCh37 v/s Hg19

| Comments

GRCh37 is supposedly the NCBI equivalent of UCSC’s hg19.

However the forum posts on are not really convincing. There are more insights here


Point #1: Ensembl uses a one-based coordinate system, whereas UCSC uses a zero-based coordinate system.

    Ensembl:  ATCG
    UCSC:     ATCG

For display UCSC browser uses 1-based coordinate system.

Point #2: Ensembl uses the most recently updated human genome housed at the GRC. This current major assembly release is called GRCh38. NCBI and UCSC use the same genome. UCSC refers to the recent human genome as GRCh38/hg38. Before all that let’s see what these coordinate sytems are all about. Blatantly copying from the SAM1 specification[]:

1-based coordinate system: A coordinate system where the first base of a sequence is one. In this coordinate system, a region is specified by a closed interval. For example, the region between the 3rd and the 7th bases inclusive is [3, 7]. The SAM, VCF, GFF and Wiggle formats are using the 1-based coordinate system.

0-based coordinate system: A coordinate system where the first base of a sequence is zero. In this coordinate system, a region is specified by a half-closed-half-open interval. For example, the region between the 3rd and the 7th bases inclusive is [2, 7). The BAM, BCFv2, BED, and PSL formats are using the 0-based coordinate system.]

Let’s take an example here:

    chrX  	|  A  |  T  |  G  |  C  |  A  |  T  |  C  |  G  |
    0-based	0  |  1  |  2  |  3  |  4  |  5  |  6  |  7  |  8
    1-based    1	     2     3     4     5     6     7     8

So expanding from the SAM1 specifications,

  1. Ranged nucleotides will be represented as:

    i) 0-based chrX:2-7 GCATC
    ii) 1-based chrX:3-7 GCATC

  2. From point 1. a single nucleotide is represented as:

    i) 0-based chrX:7-8 G
    ii) 1-based chrX:8-8 G

  3. SNPs will be:

    i) 0-based chrX:7-8 G/A
    ii) 1-based chrX:8-8 G/A

    Insertions and deletions are tricky:

     chrX        |  A  |  T  |  G  |  C  |  A  |  T  |  C  |  C  |  ATC  G  |
     0-based     0  |  1  |  2  |  3  |  4  |  5  |  6  |  7  |  8       |  9
     1-based        1     2     3     4     5     6     7     8          9

ATC is inserted in between C & G of the original sequence.

  1. INSertions:

    i) 0-based chrX:8-8 -/ATC
    ii) 1-based chrX:8-9 -/ATC

  2. DELetion:

    i) 0-based chrX:4-5 A/-
    ii) 1-based chrX:5-5 A/-

Example [,_half-open_mean?]:

$ cat first_base.bed
chr1    0       1       i-am-the-first-position-on-chrom1

Given the examples the conversion is straightforward:

For nucleotides, SNP, DEL:
        (1-based start) = (0-based start + 1)
        (1-based end)   = (0-based end)
For insertion:
        (1-based start) = (0-based start)
        (1-based end)   = (0-based end + 1)

Why a 0-based system? It makes the calculations simple: Consider chrX0-2 ATC & chrX1-4 ATCG

Length of overlap = min(end1, end2) - max(start1, start2) = 2-0 = 2 Whereas for a 1-based system:
chrX1-3 ATC & chrX2-5 ATCG
Length of overlap = min(end1, end2) - max(start1, start2) = 3-2 = 1 => WRONG! Why? You need to add 1

It is intuitive if you think of the numbers between(including) 1 & 5 is 5-1+1=5! [1,2,3,4,5]

An important note for uploading data to UCSC Genome browser: > If you submit data to the browser in position format (chr#:##-##), the browser assumes this information is 1-based. If you submit data in any other format (BED (chr# ## ##) or otherwise), the browser will assume it is 0-based.You can see this both in our liftOver utility and in our search bar, by entering the same numbers in position or BED format and observing the results. Similarly, any data returned by the browser in position format is 1-based, while data returned in BED format is 0-based.

Coming back to hg19 and GRCh37 stuff,

The GRCh37 assembly defintion is at:

Besides Chromosomes 1-22, there are these things called ‘unlocalized-scaffold’ and ‘unplaced-scaffold’.

Let’s head over to the definitions of the Consortium:

A sequence found in an assembly that is associated with a specific chromosome but cannot be ordered or oriented on that chromosome.
e.g HSCHR17_RANDOM_CTG2 associated with chromosome 17, but cannot be oriented on it

A sequence found in an assembly that is not associated with any chromosome.

hg19 equivalent definitions, and corresponding data

In the past, these tables contained data related to sequence that is known to be in a particular chromosome, but could not be reliably ordered within the current sequence.
E.g. chr21_gl000210_random.fa.gz

ChrUn: ChrUn contains clone contigs that can’t be confidently placed on a specific chromosome. E.g chrUn_gl000237

Another note:

UCSC’s hg19 has chrM as the mitochondiral DNA. > Note on chrM: Since the release of the UCSC hg19 assembly, the Homo sapiens mitochondrion sequence (represented as “chrM” in the Genome Browser) has been replaced in GenBank with the record NC_012920. We have not replaced the original sequence, NC_001807, in the hg19 Genome Browser. We plan to use the Revised Cambridge Reference Sequence (rCRS, in the next human assembly release

In order to verify that they indeed have atleast the same chromosome lenths I hacked this simple script.

Library Size, Fragment Size, Insert Size

| Comments

Bioinformatics has always been plagued with terms with multi-definitions for the same term. The FastQ format for example remained undocumented until Peter Cock et al. published The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants

I will try to cover a detailed explanation of the following terms , which I have often found to be confusing, thanks to the lack of cohesiveness in Bioinformatics:

  • library size
  • fragment size
  • insert size

SAMv1 specification defines a ‘TLEN’ term in the alignment section: > TLEN: signed observed Template LENgth. If all segments are mapped to the same reference, the unsigned observed template length equals the number of bases from the leftmost mapped base to the rightmost mapped base. The leftmost segment has a plus sign and the rightmost has a minus sign. The sign of segments in the middle is undefined. It is set as 0 for single-segment template or when the information is unavailable.

Going by this definition the ‘insert size’ should mean: The total length of the original piece being sequenced(without adapters!). Let’s pick an example from Seqanswers, slightly modified here. Reads R_1 and R_2 are paired end reads, sequencing the same sequence from two opposite ends. We want to sequence the ‘Piece to be Sequenced’.

Piece to be Sequenced:                       |=====50=====|==========150==========|=====50=====|
Fragment=Piece + Sequencing Adapters: |~~20~~|=====50=====|==========150==========|=====50=====|~~20~~|    
Reads(Paired End):                           |----------->|                       |<-----------|    
                                                  R_1                                  R_2
Inner Mate Distance:                                      |==========150==========|         

Sequencing chops the DNA into shorter pieces, and these pieces are ligated with adapter sequences.Though we started with Piece to be sequenced it was ligated with sequencing adapters on both sides.
The original sequence that was supposed to be sequenced was (50+150+50=) 250 bp long. However while the DNA was cut, the ligation step added extra bases at both ends of this piece so that it is now (20+50+150+50+20=) 290 bp long and this ‘piece+ligated sequence adapters’ are called ‘Fragment’(This is the reason I chose the word ‘piece’ to denote the part of DNA to be sequenced). Generally sequencing begins after adapter sequences. This is the reason R_1 and R_2 do not cover the 20 bp adapter sequences. In any case I assume the reads have been trimmed to remove the adapter sequences. The 150 bp sequence that is left unsequenced here is the distance between mate pairs, or the Inner Mate Distance.

Summarizing with respect to the above example:

    Insert Size      = 50+150+50        = 250
    Fragment Size    = 20+50+150+50+20  = 290 
    Library Size     = 20+50+150+50+20  = 290
    Inner Mate Distance                 = 150            


    Insert           = Sequence without the adapters(The original DNA chunk)
    Fragment         = Insert Size + Right/Left flanking adapters
    Inner Mate       = Gap between the mate pair reads

Reference: Seqanswers