JASPAR TFBS extraction

About JASPAR TFBS extraction

The JASPAR TFBS extraction tool allows the extraction of JASPAR transcription factor binding sites (TFBSs) intersecting with an input set of regions provided as a BED file. Optionally, it is possible to provide a file with a list of transcription factors (TFs), JASPAR matrix IDs, or a minimal TFBS score to filter the results of the intersection.

Results will be printed to standard output when no output directory is specified. Optionally, users can provide an output directory where the results will be saved.

Getting the JASPAR TFBS data

All JASPAR TFBSs can be found in our repository. There, TFBS collections are grouped by JASPAR release to ease the usage of release-specific TFBS predictions. The required bigBed files to run this tool are found within each release-specific directory (i.e. JASPAR2022_hg38.bb to use the hg38 TFBS predictions from the 2022 release).

Usage of the tool

The tool is run with the following command:

./bin/extract_TFBSs_JASPAR.sh -i INPUT BED -b INPUT BIGBED [-o OUTPUT] [-t TFs] [-m MATRIX IDs] [-s SCORE THRESHOLD] [-p NUM PROCESSORS]

where the following options are mandatory:

  • INPUT BED is a BED file containing the regions of interest. Note! Currently, compressed BED files such as .bed.gz are not supported neither as input or output.
  • INPUT BIGBED is a bigBed file containing the JASPAR TFBSs. This file is expected to be obtained from our TFBS repository (see section Getting the JASPAR TFBS data).

And the following ones are optional:

  • OUTPUT is a path to the output file. When this is not provided, the extraction results are sent to standard output.
  • TFs is a file containing a list of TF gene symbols separated by a new line. When provided, only TFBSs for the specified TFs will be shown.
  • MATRIX IDs is a file containing a list of JASPAR matrix IDs separated by a new line. When provided, only TFBSs for the specified matrix IDs will be shown.
  • SCORE THRESHOLD is an integer denoting the minimal score a TFBS should show. For more information about the correspondence between TFBS score and a p-value, see this page here.
  • NUM PROCESSORS is the number of cores to run in parallel (default = 2).

In the example_files directory of our bitbucket repository we provide a set of example files to illustrate how the tool works. Running the following will return all Ciona intestinalis TFBSs intersecting with the file in example_files/ciona_regions.bed and print the results to stdout:

./bin/extract_TFBSs_JASPAR.sh \
  -i example_files/ciona_regions.bed \
  -b example_files/JASPAR2022_ci3.bb \
  -t example_files/TFs.txt \
  -s 300

chr1    63  72  MA0118.1    306 -   Macho-1

For more information, please refer to our bitbucket repository.