\
Note that much more extensive documentation is available in Querying Ensembl.
Ensembl provides access to their MySQL databases directly or users can download and run those databases on a local machine. To use the Ensembl’s UK servers for running queries, nothing special needs to be done as this is the default setting for PyCogent’s ensembl module. To use a different Ensembl installation, you create an account instance:
>>> from cogent.db.ensembl import HostAccount
>>> account = HostAccount('fastcomputer.topuni.edu', 'username',
... 'canthackthis')
To specify a specific port to connect to MySQL on:
>>> from cogent.db.ensembl import HostAccount
>>> account = HostAccount('fastcomputer.topuni.edu', 'dude',
... 'ucanthackthis', port=3306)
To see what existing species are available
>>> from cogent.db.ensembl import Species
>>> print Species
================================================================================
Common Name Species Name Ensembl Db Prefix
--------------------------------------------------------------------------------
A.aegypti Aedes aegypti aedes_aegypti
Alpaca Vicugna pacos vicugna_pacos...
If Ensembl has added a new species which is not yet included in Species, you can add it yourself.
>>> Species.amendSpecies('A latinname', 'a common name')
You can get the common name for a species
>>> Species.getCommonName('Procavia capensis')
'Rock hyrax'
and the Ensembl database name prefix which will be used for all databases for this species.
>>> Species.getEnsemblDbPrefix('Procavia capensis')
'procavia_capensis'
We query for the BRCA2 gene for humans.
>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=58, account=account)
>>> print human
Genome(Species='Homo sapiens'; Release='58')
>>> genes = human.getGenesMatching(Symbol='BRCA2')
>>> for gene in genes:
... if gene.Symbol == 'BRCA2':
... print gene
... break
Gene(Species='Homo sapiens'; BioType='protein_coding'; Description='breast cancer 2,...'; StableId='ENSG00000139618'; Status='KNOWN'; Symbol='BRCA2')
We use the stable ID for BRCA2.
>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=58, account=account)
>>> gene = human.getGeneByStableId(StableId='ENSG00000139618')
>>> print gene
Gene(Species='Homo sapiens'; BioType='protein_coding'; Description='breast cancer 2,...'; StableId='ENSG00000139618'; Status='KNOWN'; Symbol='BRCA2')
We look for breast cancer related genes that are estrogen induced.
>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=58, account=account)
>>> genes = human.getGenesMatching(Description='breast cancer estrogen')
>>> for gene in genes:
... print gene
Gene(Species='Homo sapiens'; BioType='protein_coding'; Description='breast cancer estrogen-induced...'; StableId='ENSG00000181097'; Status='KNOWN'; Symbol='AC105219.1')
We get the canonical transcripts for BRCA2.
>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=58, account=account)
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
>>> transcript = brca2.CanonicalTranscript
>>> print transcript
Transcript(Species='Homo sapiens'; CoordName='13'; Start=32889610; End=32973347; length=83737; Strand='+')
>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=58, account=account)
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
>>> transcript = brca2.CanonicalTranscript
>>> cds = transcript.Cds
>>> print type(cds)
<class 'cogent.core.sequence.DnaSequence'>
>>> print cds
ATGCCTATTGGATCCAAAGAGAGGCCA...
>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=58, account=account)
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
>>> for transcript in brca2.Transcripts:
... print transcript
Transcript(Species='Homo sapiens'; CoordName='13'; Start=32889610; End=32973347; length=83737; Strand='+')
Transcript(Species='Homo sapiens'; CoordName='13'; Start=32953976; End=32972409; length=18433; Strand='+')
We show just for the canonical transcript.
>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=58, account=account)
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
>>> print brca2.CanonicalTranscript.Exons[0]
Exon(StableId=ENSE00001184784, Rank=1)
We show just for the canonical transcript.
>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=58, account=account)
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
>>> for intron in brca2.CanonicalTranscript.Introns:
... print intron
Intron(TranscriptId=ENST00000380152, Rank=1)
Intron(TranscriptId=ENST00000380152, Rank=2)
Intron(TranscriptId=ENST00000380152, Rank=3)...
>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=58, account=account)
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
>>> print brca2.Location.CoordName
13
>>> print brca2.Location.Start
32889610
>>> print brca2.Location.Strand
1
We query the genome for repeats within a specific coordinate range on chromosome 13.
>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=58, account=account)
>>> repeats = human.getFeatures(CoordName='13', Start=32879610, End=32889610, feature_types='repeat')
>>> for repeat in repeats:
... print repeat.RepeatClass
... print repeat
... break
SINE/Alu
Repeat(CoordName='13'; Start=32879362; End=32879662; length=300; Strand='-', Score=2479.0)
We query the genome for CpG islands within a specific coordinate range on chromosome 11.
>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=58, account=account)
>>> islands = human.getFeatures(CoordName='11', Start=2150341, End=2170833, feature_types='cpg')
>>> for island in islands:
... print island
... break
CpGisland(CoordName='11'; Start=2158951; End=2162484; length=3533; Strand='-', Score=3254.0)
We find the genetic variants for the canonical transcript of BRCA2.
Note
The output is significantly truncated!
>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=58, account=account)
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
>>> transcript = brca2.CanonicalTranscript
>>> print transcript.Variants
(<cogent.db.ensembl.region.Variation object at ...
>>> for variant in transcript.Variants:
... print variant
... break
Variation(Symbol='rs55880202'; Effect='5PRIME_UTR'; Alleles='C/T')...
We get a single SNP and print it’s allele frequencies.
>>> snp = list(human.getVariation(Symbol='rs34213141'))[0]
>>> print snp.AlleleFreqs
=============================
allele freq sample_id
-----------------------------
A 0.0303 913
G 0.9697 913
-----------------------------
We create a Compara instance for human, chimpanzee and macaque.
>>> from cogent.db.ensembl import Compara
>>> compara = Compara(['human', 'chimp', 'macaque'], Release=58,
... account=account)
>>> print compara.method_species_links
Align Methods/Clades
===================================================================================================================
method_link_species_set_id method_link_id species_set_id align_method align_clade
-------------------------------------------------------------------------------------------------------------------
469 10 33006 PECAN 16 amniota vertebrates Pecan
467 13 32905 EPO 12 eutherian mammals EPO...
We first get the syntenic region corresponding to human gene BRCA2.
>>> from cogent.db.ensembl import Compara
>>> compara = Compara(['human', 'chimp', 'macaque'], Release=58,
... account=account)
>>> human_brca2 = compara.Human.getGeneByStableId(StableId='ENSG00000139618')
>>> regions = compara.getSyntenicRegions(region=human_brca2, align_method='EPO', align_clade='primates')
>>> for region in regions:
... print region
SyntenicRegions:
Coordinate(Human,chro...,13,32889610-32962969,1)
Coordinate(Chimp,chro...,13,32082473-32155304,1)
Coordinate(Macaque,chro...,17,11686607-11760932,1)...
We then get a cogent Alignment object, requesting that sequences be annotated for gene spans.
>>> aln = region.getAlignment(feature_types='gene')
>>> print repr(aln)
3 x 11471 dna alignment: Homo sapiens:chromosome:13:3296...