JASPAR Documentation

JASPAR is a collection of transcription factor DNA-binding preferences, modeled as matrices. These can be converted into Position Weight Matrices (PWMs or PSSMs), used for scanning genomic sequences.

JASPAR is the only database with this scope where the data can be used with no restrictions (open-source). For a comprehensive review of models and how they can be used, please see the following reviews

Modeling the specificity of protein-DNA interactions Quant Biol. 2013 Jun;1(2):115-130

Applied bioinformatics for the identification of regulatory elements Nat Rev Genet. 2004 Apr;5(4):276-87

 

 

1 JASPAR Collections

The JASPAR database consists of smaller subsets of profiles known as collections. Each of these collections have different goals as described below. The main collection is known as JASPAR CORE and is the collection most scientists use.

 

1.1 Version control

Since JASPAR 4, all matrix models have versions. This is primarily to keep track of improvements – which can be anything from correcting typos to actually making a new model based on new data. Version control works as follows:  IDs are based on a stable ID, and a version number, so that the whole ID is  [stable ID].[version]. The stable ID follows a certain transcription factor, or other logic unit such as a dimer pair. For instance, the stable ID for the factor GATA1 is MA0035. However, the GATA1 matrix has been updated twice with new data, so there are currently three versions: MA0035.1, MA0035.2 and MA0035.3. Per default, only the latest version is shown, but it is possible to list all versions of a matrix with the same stable ID.

 

 

1.2 The JASPAR CORE Collection

The JASPAR CORE collection contains a curated, non-redundant set of TF binding profiles. All profiles are derived from published collections of experimentally defined transcription factor binding sites for multi-cellular eukaryotes. The TF binding profiles were historically determined from SELEX experiments or the collection of data from the experimentally determined binding regions of actual regulatory regions. More recent profiles are derived from high-throughput techniques such as ChIP-sequencing, Protein Binding Microarray, or High-Throughput SELEX. One of the central goals of the JASPAR CORE is to provide a single, “best” model for each transcription factor. This means that the database is non-redundant in the sense that there are not many models for the same factor (with some few exceptions motivated by the recognition of significantly different motifs).

 

The prime difference to similar resources (TRANSFAC, etc) consist of the open data access, non-redundancy and quality: JASPAR CORE is a smaller set that is non-redundant and curated.

JASPAR CORE is what most scientists mean when referring to JASPAR in manuscripts.

 

 

For convenience, JASPAR CORE is divided by larger groups of species. This distinction is mainly used in the web interface and, optionally, in the download section. Currently these larger taxonomic groups are: vertebrates, planst, insects, nematodes, fungi, plants and urochordates.

 

 

 

What annotation data does each entry hold?

ID

a unique identifier for each model. CORE matrices always have a MAnnnn IDs. Version

Name

The name of the transcription factor. As far as possible, the name is based on the standardized Entrez gene symbols. In the case the model describes a transcription factor hetero-dimer, two names are concatenated, such as RXR-VDR. In a few cases, different splice forms of the same gene have different binding specificity: in this case the splice form information is added to the name, based on the relevant literature.

Class

Structural class of the transcription factor, based on the TFClass system

Family

Structural sub-class of the transcription factor, based on the TFClass system

Species

The species source for the sequences, in Latin. Linked to the NCBI Taxonomic browser. The actual database entries are the NCBI tax IDs – the latin conversion is only in the web interface.

Tax_group

Group of species, currently consisting of 4 larger groups: vertebrate, insect, plant, chordate

Acc

A representative protein accession number in Genbank for the transcription factor. Human takes precedence if several exists.

Type

Methodology used for matrix construction  (see below)

Pubmed ID

a link to the relevant publication reporting the sites used in the mode building

Pazar_tf_id

A link to the PAZAR database

Comment

For some matrices, a curator comment is added

When should it be used?

This is main JASPAR collection and should be used when curated, non-redundant binding profile models for specific factors derived from experimental data are required.

 

 

1.3 Other JASPAR Collections

The other JASPAR collections are collections of matrices that do not fit under the JASPAR CORE scope. Examples include splice forms, computationally derived patterns with no linked transcription factors,  meta-models etc.

 

1.3.1.    JASPAR FAM

The JASPAR FAM database consists of 11 models describing shared binding properties of structural classes of transcription factors. These types of models can be called "familial profiles", "consensus matrices" or metamodels. The models have two prime benefits: 1)Since many factors have similar target sequences, we often experience multiple predictions at the same locations that correspond to the same site. This type of models reduce the complexity of the results. 2)The models can be used to classify newly derived profiles (or project what type of structural class its cognate transcription factor belongs to). The construction of the models is based on the JASPAR CORE collection and described in detail in

Sandelin A, Wasserman WW. Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics J Mol Biol. 2004 Apr 23;338(2):207-15. A recent, comprehensive study of familial binding profiles and associated methods is available in (that plos paper by maohney et al)

What data does each entry hold? 

ID

A unique identifier for each model. FAM matrices always have  MFnnnn IDs

Name

The name of model. In this database, models were built by first partitioning JASPAR CORE matrices into  structural classes – therefore, the names are essentially structure class names

PubMed ID

The source article (always J Mol Biol. 2004 Apr 23;338(2):207-15)

Included models

The JASPAR CORE matrices used to construct the model

Type

Always “Metamodel

When should it be used?

When searching large genomic sequences with no prior knowledge. For classification of new user-supplied profiles.

 

1.3.2.     JASPAR PHYLOFACTS

The JASPAR PHYLOFACTS database consists of 174 profiles that were extracted from phylogenetically conserved gene upstream elements.

For a detailed description, see Xie et al., Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals., Nature 434, 338-345 (2005) and supplementary material

In short, the authors used the following strategy. Promoters (defined as the 4-kb region around the TSS) of human genes from the RefSeq database were aligned against the genomes of mouse, rat and dog. Every consensus sequence of length between 6 and 26, defined over an alphabet of 4 unique (A,C,G,T) and 7 degenerate (R, Y, K, M, S, W, N) nucleotides, was scanned over the alignments. A motif is regarded as conserved when it appears in the alignment both for the human and for the other three mammalian species. The conservation rate p is defined as the number of times a motif is conserved divided by the number of times it occurs in man only. This conservation rate is compared to the expected conservation rate p0, estimated from random motifs, which gives the motif conservation score MCS. Only motifs with an MCS>6 were retained, resulting in a list of 174 highly conserved motifs (see supplementary Table S2 of Xie et al.). The count matrices for these 174 motifs were extracted from the downloaded alignments. They were further annotated according to their resemblance with TRANSFAC and JASPAR CORE motifs. For TRANSFAC, the annotation of Xie et al. was used. For comparing to the JASPAR CORE matrices, the Pearson Correlation Coefficient (PCC) was used to define matrix similarity. All PHYLOFACTS matrices were scanned against the JASPAR CORE matrices, and matrices were regarded as being similar when PCC>0.8. When multiple hits were found, only the one with the highest PCC was retained. .

 What data does each entry hold? 

ID

a unique identifier for each model. PHYLOFACTS matrices always have  MFnnnn IDs

Name

The name of model. In this database, models are based on over-represented words which are unique. The name is simply the consensus sequence.

Jaspar

The JASPAR CORE motif that has the best similarity score when compared to this model. Only hits with a similarity score over 0.8 are considered.

Transfac

The transfac (public version) motif that has the best similarity score when compared to this model. Only hits with a similarity score over 0.8 are considered.

Sysgroup

Group of species. Always “mammals”

Type

Always “phylogenetic

PubMed ID

The source article (always  Nature 434, 338-345 (2005))

When should it be used?

The JASPAR PHYLOFACTS matrices are a mix of motifs corresponding to motifs for known and undefined transcription factors. They are useful when one expects that other factors might determine promoter characteristics, such as structural aspects and tissue specificity. They are highly complementary to the JASPAR CORE matrices, so are best used in combination with this matrix set.

 

1.3.3.    JASPAR POLII

The deluge of novel data presented recently pertaining transcription start sites (reviewed in (13,14)) motivates computational studies of core promoters. The JASPAR_POLII sub-database holds known 13 DNA patterns linked to RNA polymerase II core promoters, such as the Inr and BRE elements, each based on experimental evidence: each model must be constructed using 5 or more experimentally verified sites. An important difference to the transcription factor profiles in JASPAR CORE is that patters here do not necessarily have a specified protein interactor (See (15) for a review on core promoter patterns).  When possible, profiles were extended by two nucleotides more than the core motif. We consistently report positions relative to the TSS as the position of 5’ and 3’ edge of the matrix. 

When data does each entry hold? 

ID

a unique identifier for each model. POLII matrices always have  POLnnn IDs

Name

The reported name of the pattern (not necessarily the binding protein, if this is known)

Species

The species source for the sequences, in Latin. “-“ generally signifies that several species were used in the model construction

PubMed ID

A link to the relevant publication reporting the sites used in the mode building

Start relative to TSS

Reported bias (if any) on position relative to the dominant transcription start site in the promoter. This is counted from the 5' end of the pattern (the left side). As we have added some flanking nucleotides, this sometimes is not the exact numbers shown in the source publications.

End relative to TSS

See above. Distance is counted from the 3' end of the matrix (the right side).

When should it be used?

When analyzing properties of core promoters.

 

1.3.4. JASPAR CNE

Highly conserved non-coding elements are a distinctive feature of metazoan genomes. Many of them can be shown to act as long-range enhancers that drive expression of genes that are themselves regulators of core aspects of metazoan development and differentiation. Since they act as regulatory inputs, attempts at deciphering the regulatory content of these elements have started. JASPAR CNE is a collection of 233 matrix profiles  derived by Xie et al based on clustering of overrepresented motifs from human conserved non-coding elements. While the biochemical and biological role of most of these patterns is still unknown, Xie et al. have shown that the most abundant ones correspond to known DNA-binding proteins, most notably insulator-binding protein CTCF. These matrix profiles will be useful for further characterization of regulatory inputs in long-range developmental gene regulation in vertebrates.

What data does each entry hold? 

ID

a unique identifier for each model. NCRNA matrices always have  CNnnnn IDs

Name

The name of model.

Consensus sequence

the consensus sequence of the motif - important as it is the basis for clustering over-represented sites in this study

PubMed ID

The source article (always Xie et al)

When should it be used?

When analyzing properties of potential enhancers.

 

1.3.4. JASPAR SPLICE

This small collection contains matrix profiles of human canonical and non-canonical splice sites, as matching donor:acceptor pairs. It currently contains only 6 highly reliable profiles obtained from human genome made by Chong et al . In the future, we shall include additional eukaryotic species, as well as new models for exonic splicing enhancers (ESE) and inhibitors (ESI).

 

What data does each entry hold? 

ID

a unique identifier for each model. SPLICE matrices always have  SPnnnn IDs

Name

The name of model.

 

 

PubMed ID

The source article (always  Chong et al )

When should it be used?

When analyzing splice sites and alternative splicing

 

1.3.5 JASPAR PBM

All the PBM collections are built by using  new in-vitro techniques, based on k-mer microarrays. PBM matrix models have their own database which is specialized for the data: UniPROBE.

The PBM, collection is the set derived by Badis et al from binding preferences of 104 mouse transcription factors. One profile (IRC900814) was excluded because the transcription factor could not be identified.

 

What data does each entry hold? 

ID

a unique identifier for each model. SPLICE matrices always have  PHnnnn IDs

Name

The name of model.

Class

Structural class of the transcription factor, based on the TFClass system

Family

Structural sub-class of the transcription factor, based on the TFClass system

Species

The species source for the sequences, in Latin. Linked to the NCBI Taxonomic browser. The actual database entries are the NCBI tax IDs – the latin conversion is only in the web interface.

Tax_group

Group of species, currently consisting of 4 larger groups: vertebrate, insect, plant, chordate

PubMed ID

A link to the relevant publication reporting the sites used in the mode building

Type

Methodology used for matrix construction 

Comment

For some matrices, a curator comment is added

When should it be used?

Where it is important that each matrix was derived using the same protocol

 

1.3.6 JASPAR PBM HOMEO

All the PBM collections are built by using  new in-vitro techniques, based on k-mer microarrays. PBM matrix models have their own database which is specialized for the data: UniPROBE.

The PBM, collection is the set derived by Berger et al including 176 profiles from mouse homeodomains

What data does each entry hold? 

ID

a unique identifier for each model. SPLICE matrices always have  PHnnnn IDs

Name

The name of model.

Class

Structural class of the transcription factor, based on the TFClass system

Family

Structural sub-class of the transcription factor, based on the TFClass system

Species

The species source for the sequences, in Latin. Linked to the NCBI Taxonomic browser. The actual database entries are the NCBI tax IDs – the latin conversion is only in the web interface.

Tax_group

Group of species, currently consisting of 4 larger groups: vertebrate, insect, plant, chordate

PubMed ID

A link to the relevant publication reporting the sites used in the mode building

Type

Methodology used for matrix construction 

Comment

For some matrices, a curator comment is added

When should it be used?

Where it is important that each matrix was derived using the same protocol, focused on homeobox factors

 

 

 

1.3.7 JASPAR PBM HLH

All the PBM collections are built by using  new in-vitro techniques, based on k-mer microarrays. PBM matrix models have their own database which is specialized for the data: UniPROBE.

The PBM HLH, collection is the set derived by Grove et al.  It holds 19 C. elegans bHLH transcription factor models

What data does each entry hold? 

ID

a unique identifier for each model. SPLICE matrices always have  PHnnnn IDs

Name

The name of model.

Class

Structural class of the transcription factor, based on the TFClass system

Family

Structural sub-class of the transcription factor, based on the TFClass system

Species

The species source for the sequences, in Latin. Linked to the NCBI Taxonomic browser. The actual database entries are the NCBI tax IDs – the latin conversion is only in the web interface.

Tax_group

Group of species, currently consisting of 4 larger groups: vertebrate, insect, plant, chordate

PubMed ID

A link to the relevant publication reporting the sites used in the mode building

Type

Methodology used for matrix construction 

Comment

For some matrices, a curator comment is added

 When should it be used?

Where it is important that each matrix was derived using the same protocol, focused on bHLH factors

 

 

 

 

2.  JASPAR WEB API

The JASPAR database can now be reached remotely through a new Web Service interface. Current functionality includes retrieval of profiles by name, by identifier and by searching profile annotations. Profiles can be retrieved as position frequency matrices, position weight matrices or information content matrices. The purpose of providing an external application programming interface (API) is to simplify the utilization of JASPAR in distributed applications and in scientific workflows created in workflow editors like Triana , BPEL , or Taverna . Other benefits include platform- and language independent access, as well as constant up-to-date access to the database over time. The API is implemented as a WS-I compliant Web service, identical to the technology used for the services made available through the EMBRACE Network of Excellence, and the Web service technology chosen by the European Bioinformatics Institute (EBI) .  The WSDL describing this service can be found here. Further information about the Web service is available in the WSDL file, including example clients in Java and Python.

 

3. JASPAR SOFTWARE TOOLS

JASPAR is supported by a growing number of open-source software tools implemented in various programming languages including Perl, Python (Biopython), R (Bioconductor) and Ruby.


3.1 JASPAR Perl API

JASPAR has a fully fledged API in the Perl programming language. Specifically, the TFBS::DB::JASPAR6.pm, interacts with the mysql version of JASPAR. The API can be used for a large number of tasks  - including searching sequences and alignments. See the TFBS page and the TFBS::DB::JASPAR6 section in the manual. JASPAR5.pm through JASPAR7.pm are very similar to JASPAR4 in terms of methods, but are built on another data model ‘under the hood’


3.2 BioPython module

BioPython now includes a JASPAR specific sub-class of Bio.motifs for fetching profiles from the JASPAR DB as well as reading/writing motifs in various JASPAR supported flat file matrix formats.


3.3 Bioconductor Package

A Bioconductor package providing funtionality to access the JASPAR database through R.
http://www.bioconductor.org/packages/devel/data/experiment/html/JASPAR2014.html


3.4 Ruby Gem

A Ruby gem providing basic funtionality for parsing, searching, and comparing JASPAR motifs.
https://rubygems.org/gems/bio-jaspar


3.5 R TFBSTools

An R alternative to the TFBS Perl module.
http://www.bioconductor.org/packages/2.13/bioc/html/TFBSTools.html


3.6 Transcript Factor Flexible Models (TFFM)

New in JASPAR 2016, 130 profiles are provided as TFFM models.
http://cisreg.cmmt.ubc.ca/TFFM/doc/


3.7 JASPAR at CRAN

A set of R modules to handle the JASPAR template (an internal data format) to store TFBS profiles and related meta information, mainly for internal sharing purpose.
http://cran.r-project.org/web/packages/JASPAR/index.html

 

4. DOWNLOADING DATABASES

JASPAR is downloadable in two different formats,  from the DOWNLOAD link in the start page:

i)  flat files resulting from the TFBS::DB::MatrixDir function in the perl API, which are easily parsable

ii) flat files corresponding top the mysql tables used internally in the database. The create table statements are here 

Downloading sites and alignments

In the  DOWNLOAD directory, most matrix collections have a SITE subdirectory, which for each model lists all sites used for the model construction  as a fasta file. The alignments are implicit – the used sub-parts of sequences are in capitals. Note that in the majority of cases, this is an interpretation – we use pattern finders to find the most likely alignment, but this might not always be the most correct. This is the principal reason we make these collections available – users can make their own models based on the raw files.

 

 

5. JASPAR WEB HANDBOOK 

a.    START PAGE

The start page has four major “tabs” that determines the way in which you will  interact with respective JASPAR database. First, use the “ SELECT A JASPAR SUB-DATABASE”  (this will also give a  brief summary of respective database. After this, you can either use the

BROWSE   tab: The whole selected database will be shown  (see the below for information about the browse page), sorted by the selected attribute (default attribute is ID). As many users are interested in the JASPAR CORE collection, you can click the “Browse the JASPAR CORE collection right away” just under the JASPAR image.

 SEARCH BY  tab:  selection of subsets of profiles using user set criteria, use the search by fields.

Multiple inputs are acceptable, if submitted with commas in between ',' (this will in effect be interpreted as an OR statement)

The criteria will be interpreted from top to bottom by using the boolean statement at each row:

an AND statement will perform an intersect between two query results

an OR statement will perform a union of two query results

an NOT statement will filter out results from the first query

ALIGN to custom matrix tab: In some cases, it is beneficial to assess similarity to input data (as with using BLAST for sequence data comparison when using Genbank). The input profile can consist of actual counts or be normalized (each column sum =1). Log-odds matrices should be avoided. For an example of the input format, press the “fill in an example matrix”. The A[ ] etc characters are optional, so:

A [13 13 3 1 54 1 1 1 0 3 2 5 ]
C [13 39 5 53 0 1 50 1 0 37 0 17 ]
G [17 2 37 0 0 52 3 0 53 8 37 12 ]
T [11 0 9 0 0 0 0 52 1 6 15 20 ]


is equivalent to

13 13 3 1 54 1 1 1 0 3 2 5
13 39 5 53 0 1 50 1 0 37 0 17
17 2 37 0 0 52 3 0 53 8 37 12
11 0 9 0 0 0 0 52 1 6 15 20

 

All profiles in the selected database will be compared to the input profile, using a modified Needleman-Wunsch algorithm described in

 

Sandelin A, Hoglund A, Lenhard B, Wasserman WW. Integrated analysis of yeast regulatory sequences for biologically linked clusters of genes. Funct Integr Genomics. 2003 Jul;3(3):125-34 and sorted by raw comparison score (for reference, the maximum score is 2*the width of the smallest matrix in the compared pair). Both the score and fraction of potential maximal score is reported.

 

b.    BROWSE PAGE

This page presents a list of selected profiles. At the top, it is posisble to select a subset of the matrices, much like in the start page described above.

Note that the columns of this page will differ between databases, but ID and Name attributes will always be shown, as well as a sequence logo for each pattern. For detailed information regarding any profile model, press the view link – this will give a pop-up window with detailed matrix information (see below).

At the right-hand side, a number of functional analyses can be made, with selected profiles (see the selection field in the left-most column). Currently, its is possible to i) Cluster selected matrices into familial binding profiles using the STAMP tool, ii) create permuted matrices using column shuffling o the selected matrices  iii) create random matrices using a Bayesian sampling procedure,  iv) perform basic sequence analysis (scanning an input sequence with matrices).

These features are described in detail below, in section d) : extended functionality.

c.    DETAILED MATRIX INFORMATION

These pop-up pages are results of clicking on the logos of chosen models in the browse page, and show detailed information about the model: both annotation data (which is different in different databases – see respective database entry above), and a sequence logo, a count matrix and hits/bp statistics:

LOGO:     

 Sequence logos are graphical representations of the matrix model, based on information content. The information content of a matrix column ranges from 0 (no base preference) and 2 (only 1 base used). A sequence logo is basically a barplot showing the total information content in each position, where the bar is replaced by stacked letters (A,C,G,T), which are sized and sorted  relative to their occurrence. See Schneider et al  for a more comprehensive description.

The “Make SVG button” gives an SVG version logo suitable for publication images  (SVG is a vector format that does not have the pixel edges of the .png’s used in browsers – it can be read by many drawing programs and most web-browsers with proper plugins)

COUNT MATRIX: 

The underlying model showing the DNA pattern. In most databases, the cell numbers indicate the number of sequences having base x in column y. These matrices can be used for a number of different analyses, including site searching, if suitably converted, See Wasserman and Sandelin for a review.

The reverse complement button make a reverse complement version of the matrix (as DNA is two-stranded, the two models are functionally equivalent). If the amtrix if reverse-complemented, the logo will change accordingly.

VERSION INFORMATION

For some transcription factors, there are multiple models – usually this is due to new data becoming available. Clicking on this link gives a liting of all the models for the factor in question.

EXPECTED HITS/BP

In order to visualize the binding properties of each JASPAR matrix we calculate the average number of hits per 1000 base pairs on three distinctly different sequence sets. We do this by converting the count matrix to a log-odds matrix using a uniform background model over the four bases. For a series of threshold values ([1, 0.95, ... , 0.65, 0.60]) of the scoring range of the log-odds matrix we count the number of hits equal to or greater than the current threshold. We count the number of hits treating each sequence set as one string and then convert this number to a mean value per 1000 base pairs on both strands, that is, we search both the leading strand and the reverse complement. All means are for practical purposes rounded to one decimal.

We use three distinct sequence sets, known promoters, CpG islands and random DNA respectively. The known promoters consist of all plant, arthropod and vertebrate promoters in the -1000 to +100 region from the EPD database [ref 1]. This sequence set totals 4735 promoters concatenated into one string. The CpG sequence set consists of all regions from the UCSC genome browser (hg18) with an epigenetic score above 0.5 (See Bock et al).. This totals 8,559,418 nucleotides. Finally the random DNA sequences are randomly picked 1000 base pair windows from hg18 across all chromosomes and totals 8,000,000 nucleotides. The randomly picked DNA is not repeat-masked or in any way filtered. 

 

d. EXTENDED FUNCTIONALITY

i.     BASIC SEQUENCE ANALYSIS 

Using a subset of profiles, a submitted sequence can be analyzed. Sensitivity and specificity will be affected by the relative score threshold, by default 80% (See Wasserman and Sandelin for a review on scoring of matrices to sequences) . This is the most basic form of sequence analysis: dedicated systems such as ConSite are preferable for anything more than a casual analysis.

 

 ii.     DYNAMIC CLUSTERING OF MATRICES

The CLUSTER button provides the user with a means of investigating the relationship between the various matrices. This functionality is provided by the STAMP tool available as a webservice at http://www.benoslab.pitt.edu/stamp/

.
Hierarchical clustering is performed on a selected set of matrices using the UPGMA algorithm with a Pearson Correlation Coefficient distance metric. Then the optimal number of clusters is selected using a log variant of the Calinski and Harabasz statistic (See this link for details).Finally the clusters are partioned and a familial binding profile is created for each cluster using an iterative refinement, multiple alignment method. Further details can be found in the STAMP manuscript.

 

iii.     DYNAMIC RANDOM MATRIX GENERATION

1.    PERMUTATIONNG

This option simply shuffles the columns in matrices. This can either be done by just shuffling columns within each selected matrix, or by shuffling columns almong all selected matrices.

2.    SAMPLING

This feature of the database enables the users to generate random Position Frequency Matrices (PFMs) from selected profiles.


We assume that each column in the profile is independent and described by a mixture of Dirichlet multinomials in which the letters are drawn from a multinomial and the multinomial parameters are drawn from a mixture of Dirichlets. Within this model each column has its own set of multinomial parameters but the higher level parameters -- those of the mixture prior is assumed to be common to all Jaspar matrices. We can therefore use a maximum likelihood approach to learn these from the observed column counts of all Jaspar matrices. The maximum likelihood approach automatically ensures that matrices receive a weight relative to the number of counts it contains.


Drawing samples from the prior distribution will generate PWMs with the same statistical properties as the Jaspar matrices as a whole. PWMs with statistical properties like those of the selected profiles can be obtained by drawing from a posterior distribution which is proportional to the prior times a multinomial likelihood term with counts taken from one of the columns of the selected profiles.


Each 4-dimensional column is sampled by the following three-step procedure: 1. draw the mixture component according to the distribution of mixing proportions, 2. draw an input column randomly from the concatenated selected profiles and 3. draw the probability vector over nucleotides from a 4-dimensional Dirichlet distribution. The parameter vector alpha of the Dirichlet is equal to the sum of the count (of the drawn input)  and the parameters of the Dirichlet prior (of the drawn component).  .


Draws from a Dirichlet can be obtained in the following way from Gamma distributed samples:

 

(X1,X2,X3,X4) = (Y1/V,Y2/V,Y3/V,Y4/V) ~ Dir(α1,α2,α3,α4)


where V = sum(Yi) ~ Gamma(shape = sum(αi), scale = 1).

 

3. OUTPUT FORMATS


For both and random generating of matrices you have the choice between three different output formats:

 

Raw - Each matrix is separated by a fasta like header starting with the > symbol and then a matrix ID. The count for each base (ACGT) is specified on its own space separated line where each element corresponds to one column. The order of the lines for the bases is A,C,G and finally T.

 

13 13 3 1 54 1 1 1 0 3 2 5

13 39 5 53 0 1 50 1 0 37 0 17

17 2 37 0 0 52 3 0 53 8 37 12

11 0 9 0 0 0 0 52 1 6 15 20

 

JASPAR - This is similar to the raw format, having an identical header. The lines for each base however starts with a label for the nucleotide (A,C,G or T) and then the columns follow enclosed in brackets: [].

 

A [13 13 3 1 54 1 1 1 0 3 2 5 ]

C [13 39 5 53 0 1 50 1 0 37 0 17 ]

G [17 2 37 0 0 52 3 0 53 8 37 12 ]

T [11 0 9 0 0 0 0 52 1 6 15 20 ]

 

TRANSFAC - This is a TRANSFAC-like format having a header starting with "DE" then the matrix ID, the matrix name and the matrix class. The data itself is transposed as compared to the other formats, meaning that each line correspond to a column in the matrix. The column lines start with a number denoting the column index (counting

from 0). After that follows tab separated counts for each base in that column in the order: A,C,G and T. After the lines with the counts follows a final line containing the string: "XX".

 

DE MA0048    NHLH1    bHLH

00    13    13    17    11   

01    13    39    2    0   

02    3    5    37    9   

03    1    53    0    0   

04    54    0    0    0   

05    1    1    52    0   

06    1    50    3    0   

07    1    1    0    52   

08    0    0    53    1   

09    3    37    8    6   

10    2    0    37    15   

11    5    17    12    20   

XX

 

 

6. FAQ

a.    How do I cite JASPAR?


It depends on what you have used it for. If you simply want to acknowledge you used the last version, use

 

Mathelier A, Fornes O, Arenillas D, Chen C-y, Denay G, Lee J, Shi W, Shyr C, Tan G, Worsley-Hunt R, Zhang A, Parcy F, Lenhard B, Sandelin A, Wasserman W
JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles.
Under review

 

Otherwise:

The original JASPAR paper:    
Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B.    

JASPAR: an open-access database for eukaryotic transcription factor binding profiles.
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D91-4.

 

The first extension (JASPAR FAM and PHYLOFACTS collections):

Vlieghe D, Sandelin A, De Bleser PJ, Vleminckx K, Wasserman WW, van Roy F, Lenhard B.

A new generation of JASPAR, the open-access repository for transcription factor binding site profiles.
Nucleic Acids Res. 2006 Jan 1;34(Database issue):D95-7. 

 

Second expansion (POLII, SPLICE, CNE, many changes in the web service including matrix permutations)

Bryne JC, Valen E, Tang MH, Marstrand T, Winther O, da Piedade I, Krogh A, Lenhard B, Sandelin A.

JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update.

Nucleic Acids Res. 2008 Jan;36(Database issue):D102-6.

 

Third expansion (Large expansion of the CORE collection, including yeast and worm matrices. Also includes new PBM collections)

Portales-Casamar E, Thongjuea S, Kwon AT, Arenillas D, Zhao X, Valen E, Yusuf D, Lenhard B, Wasserman WW, Sandelin A.

JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles.

Nucleic Acids Res. 2010 Jan;38(Database issue):D105-10.

 

Fourth expansion

Mathelier, A., Zhao, X., Zhang, A. W., Parcy, F., Worsley-Hunt, R., Arenillas, D. J., Buchman, S., Chen, C.-y., Chou, A., Ienasescu, H., Lim, J., Shyr, C., Tan, G., Zhou, M., Lenhard, B., Sandelin, A. and Wasserman, W. W.

JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles.

Nucleic Acids Research, 2014 Jan;42(Database issue):D142-7.    

 

b.    Why are certain sequences not downloadable  from JASPAR CORE?

 This is due to historical reasons. JASPAR CORE was originally built in order to create familial binding profiles for as many structural classes of transcription factor classes as possible. In some experimental literature, only matrices and not sequences are available. For this project, we were forced to include some matrices to gain coverage of certain binding site classes. For recent additions, it is a requirement to have the sequences available.


d.
    Why is not my matrix study included in JASPAR CORE?

There are two principal explanations.

The most likely is that we were not aware of your work: please let us know!

The other possible reason is that the publication did not live up to the demands of the curators. As we have human curation of all JASPAR CORE matrices, this is to some degree an arbitrary call – we are happy to discuss it with you.

e.    Linking web services to CPU-intensive services within JASPAR

We appreciate that other services wants to link to JASPAR. However, if your are using the CPU-intensive services (matrix comparison, randomization or clustering), please ask the maintainers (see contact information below) before you do this – otherwise your server might be rejected without warning. In that case, we strongly suggest setting up a local JASPAR database, as the database and resources are freely available.

f.     Who is JASPAR anyhow?

JASPAR was originally the name of a master student project algorithm for comparing matrix profiles, an obscure tribute to an even more obscure dialog from the Black Adder episode “The Black Seal” between the Seven Most Evil Men in the Kingdom:

 

 - …and with all haste, we will meet at Old Jaspar’s tavern

  - How is old Jaspar these days?

  - Dead.

  - How?

  - I killed him.

 [Loud cheer].

 

g.    Where can I find "FlatFileDir/matrix_list.txt" files in JASPAR 2014 as provided in JASPAR 2010?

Visit this page!


7 Contact

We appreciate feedback – criticism as well as suggestions for new content. Development and supervision of the JASPAR project is coordinated by Albin Sandelin and Boris Lenhard.

Albin Sandelin albin AT binf.ku.dk

Boris Lenhard Boris.Lenhard AT bccs.uib.no