Creating Maps

Contents

Creating a simple map using CCT

  1. To use CCT to create a simple map with no comparisons, first create a new analysis project. In the following example, a new project directory called my_project is created:

    cgview_comparison_tool.pl -p my_project

    For details on the CCT project directory structure see Creating a New CCT Project.

  2. Place the genome sequence you wish to analyze in the reference_genome directory, which is located in the newly created my_project directory. The sequence file can be in GenBank format (with a '.gbk' file extension) or FASTA format (with a '.fasta' extension). In this example the genome sequence should be placed in the my_project/reference_genome directory:

    cp $CCT_HOME/sample_projects/sample_project_5/reference_genome/NC_001823.gbk \
       my_project/reference_genome
  3. Run CCT again:

    cgview_comparison_tool.pl -p my_project

    The final step will create a .png map in the my_project/maps directory.

Note: to create a comparison map directly with cgview_comparison_tool.pl the project_settings.conf has to be edited. See the section Editing the project_settings.conf file for an example. An easier method is to use the build_blast_atlas.sh script as described next.

Creating a BLAST atlas using CCT

CCT can be used to build "BLAST atlases" which compare a reference genome of interest to one or more other genomes or sequence collections. To simplify the creation of BLAST atlases CCT includes a wrapper script called build_blast_atlas.sh. This script automatically creates maps for nucleotide (blastn) comparisons and translated coding sequence (blastp) comparisons. It also generates multiple maps for each comparison type, differing in terms of size and detail.

  1. Run the build_blast_atlas.sh script, passing it the file describing the reference genome:

    build_blast_atlas.sh -i $CCT_HOME/sample_projects/sample_project_1/reference_genome/NC_007719.gbk

    This will produce a project directory called NC_007719. For details on the directory structure see Creating a New BLAST Atlas Project.

    The configuration files project_settings_cds_vs_cds.conf and project_settings_dna_vs_dna.conf can be edited prior to completion of the map drawing process (see Customizing CCT maps)

  2. Place the genomes you want compared to the reference genome into the NC_007719/comparison_genomes directory. These files must end with a '.gbk' extension.

    cp $CCT_HOME/sample_projects/sample_project_1/comparison_genomes/*.gbk \
       NC_007719/comparison_genomes
  3. Begin the map drawing process by running the build_blast_atlas.sh script again, pointing at the project directory:

    build_blast_atlas.sh -p NC_007719

    The above command will generate several maps of different sizes showing nucleotide or protein-based BLAST comparisons. The resulting maps can be accessed within the following directories, once the entire process is complete:

    NC_007719/maps_for_dna_vs_dna
    NC_007719/maps_for_cds_vs_cds

    The DNA_vs_DNA maps show the results of blastn comparisons between the reference genome and each comparison genome, while the CDS_vs_CDS maps show the results of blastp comparisons between the CDS translations extracted from the GenBank files. In these maps color is used to indicate the percent identify of BLAST hits. The BLAST hit rings are sorted such that the most similar genomes are presented first (closest to the outside of the circle).

Creating an all vs. all BLAST atlas using CCT

A wrapper script called build_blast_atlas_all_vs_all.sh is included with CCT. This script generates several CCT projects automatically, and then it combines the results into a single montage map. The montage consists of a separate map for each sequence of interest. This allows each sequence in a group of sequences to be visualized as the reference sequence.

  1. To generate a separate BLAST atlas for each genome in a collection of genomes, first create a new project:

    build_blast_atlas_all_vs_all.sh -p montage_project

    This will produce a project directory called montage_project. For details on the project directory structure see Creating a New BLAST Atlas All vs All Project.

    The configuration files project_settings_multi.conf can be edited prior to completion of the map drawing process (see Customizing CCT maps)

  2. Place the GenBank files for the genomes in the montage_project/comparison_genomes directory. The files must end with a '.gbk' extension. In this example we are fetching all the available Escherichia genomes:

    fetch_refseq_bacterial_genomes_by_name.sh \
      -n "Bordetella*" -o montage_project/comparison_genomes/
  3. Run the build_blast_atlas_all_vs_all.sh command again:

    build_blast_atlas_all_vs_all.sh -p montage_project

    This will start the map creation process. Within the montage_project directory a separate directory for each map is created (one for each sequence). After all the maps are created a single montage of the maps is generated called montage.png in the montage_project directory.

    The maps created using the build_blast_atlas_all_vs_all.sh script show the results of blastn comparisons by default.

Editing the project_settings.conf file

To adjust which types of analyses are performed for your sequence, you can edit the project_settings.conf file in your projects directory. Try the following:

  1. Create a new project:

    cgview_comparison_tool.pl -p my_project_2
  2. Place a genome sequence in my_project_2/reference_genome:

    cp \
    $CCT_HOME/sample_projects/sample_project_2/reference_genome/Methanobacterium_thermoautotrophicum.gbk \
    my_project_2/reference_genome
  3. In this example we will compare the reference genome to a second genome, by placing a genome sequence in the my_project_2/comparison_genomes directory:

    cp \
    $CCT_HOME/sample_projects/sample_project_2/comparison_genomes/Methanosarcina_acetivorans.gbk \
    my_project_2/comparison_genomes
  4. Edit the my_project_2/project_settings.conf file. This file controls how CCT processes your project. For example, to perform a BLAST using the reference genome coding regions as queries, find the section called 'BLAST query source settings', and change:

    Change:

    query_source = none

    To:

    query_source = cds

    Similarly, to specify how the comparison genomes are used in the BLAST comparison, find the section called 'BLAST database source settings' and change:

    Change:

    database_source = none

    To:

    database_source = trans

    The settings in this example tell CCT to extract the coding sequence translations from the reference genome GenBank file, and BLAST them against the 6-frame translation of the genome sequences in the project's comparison_genomes directory.

  5. You can also control how the results are presented using the 'Graphical map settings' section. For example, to draw feature labels, make the following change:

    Change:

    draw_feature_labels = F

    To:

    draw_feature_labels = T

    To draw a larger map (this will allow the feature labels to fit on the canvas), make this change:

    Change:

    map_size = medium

    To:

    map_size = large

    The various other settings are described in the project_settings.conf file.

  6. Now that you have edited project_settings.conf, run CCT:

    cgview_comparison_tool.pl -p my_project_2

    This command will perform a BLAST analysis and create a map in my_project_2/maps. Whenever you make changes to the project_settings.conf file you can update the map using this command.

project_settings.conf file options

Attribute Value Description
minimum_orf_length Integer The minimum ORF length used (in codons) when ORFs are extracted from genomic sequences.
genetic_code Integer The genetic code to use for translated BLAST searches and for ORF translation. The default is the bacterial genetic code (genetic code 11). See https://bioinformatics.org/sms2/genetic_code.html for descriptions of the different genetic codes.
start_codons Codons separated by '|' The start codons to be used when finding ORFs. The default set (ttg|ctg|att|atc|ata|atg|gtg) contains the starts for bacterial sequences.
stop_codons Codons separated by '|' The stop codons to use when finding ORFs.
query_size Integer The query size for BLAST searches, i.e. how much of the reference genome is used in each BLAST search. This setting only applies to 'trans' and 'nucleotide' comparisons (see the query_source option below).
expect Real The BLAST expect value to use.
score Integer The minimum score required for BLAST hits.
hits Integer The number of BLAST hits to keep for each query.
minimum_hit_proportion Real The minimum acceptable hit length for BLAST results, expressed as a proportion of the length of the query.
query_source nucleotide / trans / cds / orfs / none The source of the BLAST query sequences. These sequences are extracted from the reference genome sequence, located in the reference_genome directory. Details on the different types can be found in the Blast Comparisons section.
database_source nucleotide / dna / trans / cds / orfs / proteins / none The sources of the BLAST databases. The databases are built using the sequences in the comparison_genomes directory. Details on the different types can be found in the Blast Comparisons section.
cog_source orfs / cds / none The proteins from the reference sequence to be assigned COG functional categories. Three options are available:
  • orfs - translated ORFs from a GenBank (.gbk), FASTA (.fasta), or RAW (.raw) file.
  • cds - CDS protein sequences extracted from a GenBank (.gbk) file.
  • none - do not assign COG functional categories.
cog_top_hit T / F Whether to use only the top BLAST hit for COG functional assignment.
T / F
draw_divider T / F Whether a divider should be drawn between the start and end of the sequence to indicate that the sequence is linear.
draw_orfs T / F Whether open reading frames (ORFs) in the reference genome should be drawn.
draw_gc_skew T / F Whether GC content in the reference genome should be drawn.
draw_legend T / F Whether a feature legend should be drawn.
draw_feature_labels T / F Whether features should be labeled. It is recommended that this option be set to 'T' only when 'large', 'x-large' or 'navigable' maps are drawn (see map_size).
draw_hit_labels T / F Whether BLAST hits should be labeled. It is recommended that this option be set to 'T' only when 'large', 'x-large' or 'navigable' maps are drawn (see map_size).
draw_orf_labels T / F Whether ORFs should be labeled. It is recommended that this option be set to 'T' only when 'large', 'x-large' or 'navigable' maps are drawn (see map_size).
draw_condensed T / F Whether thin feature rings should be used. This option is useful for maps that are to be used for analysis purposes rather than as a figure for publication.
draw_divider_rings T / F Whether divider rings should be drawn between feature rings.
draw_hits_by_reading_frame T / F Whether each set of BLAST results should be divided into six slots, based on the reading frame and strand of the query gene or ORF that produced the hit. This option only applies to comparisons done when the 'query_source' option is set to 'orfs' or 'cds'.
use_opacity T / F Whether BLAST hits should be drawn with partial opacity so that overlapping hits can be seen.
scale_blast T / F Whether BLAST hits should be drawn with height proportional to percent identity of hit.
gene_decoration arc / arrow Whether genes should be drawn as an arc or as an arrow.
highlight_query T / F Whether the position of the queries should be faintly highlighted on the map. By showing the query positions it is easier to see if a hit was obtained for specific ORFs or features.
map_size small / medium / large / x-large or combination separated by commas The size of the maps to draw. Multiple options can be separated by commas (e.g. small,large).
  • small - 1000 x 1000
  • medium - 3000 x 3000
  • large - 9000 x 9000
  • x-large - 12000 x 12000

BLAST comparison types and the 'query_source' and 'database_source' settings

There are several types of BLAST comparisons that can be performed by CCT. The table below shows the compatible values for 'query_source' and 'database_source', lists the required reference and comparison sequence file types and file extensions, and describes the comparisons that are performed. Note that multiple comma-separated values can be given for 'query_source' and 'database_source'--CCT will perform all the compatible comparisons. Many different files can be included in the 'comparison_genomes' directory. CCT examines file extensions when deciding which files to include in each BLAST comparison. When there are multiple files with the same extension, a separate BLAST comparison is conducted for each, and the results are shown in separate rings on the resulting map.

query_source value database_source value reference_genome file types and file extensions required comparison_genomes file types and file extensions required description of BLAST comparison in the form 'reference vs comparison (BLAST type)'
nucleotide nucleotide GenBank (.gbk), FASTA (.fasta), RAW (.raw) GenBank (.gbk), FASTA (.fasta), RAW (.raw) DNA vs DNA (blastn)
nucleotide dna GenBank (.gbk), FASTA (.fasta), RAW (.raw) One or more DNA sequences in FASTA format (.fna) DNA vs DNA sequences (blastn)
trans trans GenBank (.gbk), FASTA (.fasta), RAW (.raw) GenBank (.gbk), FASTA (.fasta), RAW (.raw) 6-frame translated DNA vs 6-frame translated DNA (tblastx)
trans cds GenBank (.gbk), FASTA (.fasta), RAW (.raw) GenBank (.gbk) 6-frame translated DNA vs CDS protein sequences extracted from GenBank files (blastx)
trans orfs GenBank (.gbk), FASTA (.fasta), RAW (.raw) GenBank (.gbk), FASTA (.fasta), RAW (.raw) 6-frame translated reference DNA vs translated ORFs (blastx)
trans proteins GenBank (.gbk), FASTA (.fasta), RAW (.raw) One or more protein sequences in FASTA format (.faa) 6-frame translated reference DNA vs protein sequences (blastx)
trans dna GenBank (.gbk), FASTA (.fasta), RAW (.raw) One or more DNA sequences in FASTA format (.fna) 6-frame translated reference DNA vs 6-frame translated DNA sequences (tblastx)
cds trans GenBank (.gbk) GenBank (.gbk), FASTA (.fasta), RAW (.raw) CDS protein sequences extracted from GenBank file vs 6-frame translated DNA (tblastn)
cds cds GenBank (.gbk) GenBank (.gbk) CDS protein sequences extracted from GenBank file vs CDS protein sequences extracted from GenBank files (blastp)
cds orfs GenBank (.gbk) GenBank (.gbk), FASTA (.fasta), RAW (.raw) CDS protein sequences extracted from GenBank file vs translated ORFs (blastp)
cds proteins GenBank (.gbk) One or more protein sequences in FASTA format (.faa) CDS protein sequences extracted from GenBank file vs protein sequences (blastp)
cds dna GenBank (.gbk) One or more DNA sequences in FASTA format (.fna) CDS protein sequences extracted from GenBank file vs 6-frame translated DNA sequences (tblastn)
orfs trans GenBank (.gbk), FASTA (.fasta), RAW (.raw) GenBank (.gbk), FASTA (.fasta), RAW (.raw) Translated ORFs vs 6-frame translated DNA (tblastn)
orfs cds GenBank (.gbk), FASTA (.fasta), RAW (.raw) GenBank (.gbk) Translated ORFs vs CDS protein sequences extracted from GenBank files (blastp)
orfs orfs GenBank (.gbk), FASTA (.fasta), RAW (.raw) GenBank (.gbk), FASTA (.fasta), RAW (.raw) Translated ORFs vs translated ORFs (blastp)
orfs proteins GenBank (.gbk), FASTA (.fasta), RAW (.raw) One or more protein sequences in FASTA format (.faa) Translated ORFs vs protein sequences (blastp)
orfs dna GenBank (.gbk), FASTA (.fasta), RAW (.raw) One or more DNA sequences in FASTA format (.fna) Translated ORFs vs 6-frame translated DNA sequences (tblastn)

Customizing CCT maps

There are many ways to modify the contents and appearance of CCT maps. See the Tutorials section for examples. The general approaches are described below.

Labelling a subset of genes (labels_to_show.txt)

To label a subset of genes in the reference genome, place a file called labels_to_show.txt in the project directory. This file should be a tab-delimited or comma-delimited text file specifying which genes should be labeled. Each row must consist of a gene identifier followed by the text that is to be used for the label. When using a GenBank or EMBL file as the reference genome the gene identifier should match the value of the '/gene' qualifier (or the value of the '/locus_tag' qualifier if there isn't a '/gene' qualifier given for a particular gene). When describing genes using the 'features' directory and .gff files, the gene identifier should match the 'seqname' value. Note that providing a labels_to_show.txt file will cause the 'draw_feature_labels' setting in the 'project_settings.conf' file to be ignored.

Adding additional features (feature GFF files)

To add features to the map, place one or more files with a '.gff' extension in the features directory, which is located in the CCT project directory. The files should be tab-delimited or comma-delimited and should have the following column titles, in the following order: 'seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame'. The first line in the file must be the column titles. For a given entry, 'seqname' should be the name of the gene, 'feature' should be the type of gene (CDS, rRNA, tRNA, other) or the single letter COG category (J for example). 'start' and 'end' should be integers between 1 and the length of the sequence, and the 'start' value should be less than or equal to the 'end' regardless of the 'strand' value. The 'strand' value should be '+' for the forward strand and '-' for the reverse strand. All other values can be given as '.' or left blank, since they are ignored. These column titles are based on the specification of the GFF file format. If 'start' and 'end' values are not supplied, but a 'seqname' is given, this script will attempt to get the 'start' and 'end' values from the sequence file.

Adding additional analysis results (analysis GFF files)

To add analysis results to the map, place one or more files with a '.gff' extension in the analysis directory, which is located in the CCT project directory. The files should be tab-delimited or comma-delimited and should have the following column titles, in the following order: 'seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame'. The first line in the file must be the column titles. For a given entry, only the 'start', 'end', 'strand', and 'score' values are required. 'start' and 'end' should be integers between 1 and the length of the sequence, and the 'start' value should be less than or equal to the 'end' regardless of the 'strand' value. The 'strand' value should be '+' for the forward strand and '-' for the reverse strand. The 'score' value should be a real number, positive or negative. The other values can be given as '.' or left blank. These column titles are based on the specification of the GFF file format. If 'start' and 'end' values are not supplied, but a 'seqname' is given, this script will attempt to get the 'start' and 'end' values from the sequence file.

COG categories and colours

Category Colour Description
Information storage and processing [oranges/reds]
A Red RNA processing and modification
B Tomato Chromatin structure and dynamics
J Light coral Translation, ribosomal structure and biogenesis
K Dark orange Transcription
L Deep pink Replication, recombination and repair
Cellular processes and signaling [greens/yellows]
D Khaki Cell cycle control, cell division, chromosome partitioning
O Dark khaki Post-translational modification, protein turnover, and chaperones
M Olive drab Cell wall/membrane/envelope biogenesis
N Forest green Cell motility
P Yellow green Inorganic ion transport and metabolism
T Lime green Signal transduction mechanisms
U Green yellow Intracellular trafficking, secretion, and vesicular transport
V Medium spring green Defense mechanisms
W Dark sea green Extracellular structures (this doesn't appear in reference database)
Y Medium sea green Nuclear structure (this appears once in reference database)
Z Yellow Cytoskeleton
Metabolism [blues/purples]
C Cyan Energy production and conversion
G Dark turquoise Carbohydrate transport and metabolism
E Steel blue Amino acid transport and metabolism
F Deep sky blue Nucleotide transport and metabolism
H Blue Coenzyme transport and metabolism
I Slate blue Lipid transport and metabolism
Q Navy Secondary metabolites biosynthesis, transport, and catabolism
Poorly characterized [grays]
R Gray General function prediction only (examples include "Predicted thioesterase", "Predicted ATPase")
S Dark gray Function unknown (examples include "Uncharacterized conserved protein", "Predicted small secreted protein")
Unknown White Not assigned COG letter because protein is not similar to any COG