Creating Maps

Contents

Creating a simple map using CCT
Creating a BLAST atlas using CCT
Creating an all vs. all BLAST atlas using CCT
Editing the project_settings.conf file
project_settings.conf file options
BLAST comparison types and the 'query_source' and 'database_source' settings
Customizing CCT maps
Labelling a subset of genes
Adding additional features
Adding additional analysis results
COG categories and colours

Creating a simple map using CCT

To use CCT to create a simple map with no comparisons, first create a new analysis project. In the following example, a new project directory called my_project is created:
```
cgview_comparison_tool.pl -p my_project
```
For details on the CCT project directory structure see Creating a New CCT Project.
Place the genome sequence you wish to analyze in the reference_genome directory, which is located in the newly created my_project directory. The sequence file can be in GenBank format (with a '.gbk' file extension) or FASTA format (with a '.fasta' extension). In this example the genome sequence should be placed in the my_project/reference_genome directory:
```
cp $CCT_HOME/sample_projects/sample_project_5/reference_genome/NC_001823.gbk \
   my_project/reference_genome
```
Run CCT again:
```
cgview_comparison_tool.pl -p my_project
```
The final step will create a .png map in the my_project/maps directory.

Note: to create a comparison map directly with cgview_comparison_tool.pl the project_settings.conf has to be edited. See the section Editing the project_settings.conf file for an example. An easier method is to use the build_blast_atlas.sh script as described next.

Creating a BLAST atlas using CCT

CCT can be used to build "BLAST atlases" which compare a reference genome of interest to one or more other genomes or sequence collections. To simplify the creation of BLAST atlases CCT includes a wrapper script called build_blast_atlas.sh. This script automatically creates maps for nucleotide (blastn) comparisons and translated coding sequence (blastp) comparisons. It also generates multiple maps for each comparison type, differing in terms of size and detail.

Run the build_blast_atlas.sh script, passing it the file describing the reference genome:
```
build_blast_atlas.sh -i $CCT_HOME/sample_projects/sample_project_1/reference_genome/NC_007719.gbk
```
This will produce a project directory called NC_007719. For details on the directory structure see Creating a New BLAST Atlas Project.

The configuration files project_settings_cds_vs_cds.conf and project_settings_dna_vs_dna.conf can be edited prior to completion of the map drawing process (see Customizing CCT maps)
Place the genomes you want compared to the reference genome into the NC_007719/comparison_genomes directory. These files must end with a '.gbk' extension.
```
cp $CCT_HOME/sample_projects/sample_project_1/comparison_genomes/*.gbk \
   NC_007719/comparison_genomes
```
Begin the map drawing process by running the build_blast_atlas.sh script again, pointing at the project directory:
```
build_blast_atlas.sh -p NC_007719
```
The above command will generate several maps of different sizes showing nucleotide or protein-based BLAST comparisons. The resulting maps can be accessed within the following directories, once the entire process is complete:
```
NC_007719/maps_for_dna_vs_dna
NC_007719/maps_for_cds_vs_cds
```
The DNA_vs_DNA maps show the results of blastn comparisons between the reference genome and each comparison genome, while the CDS_vs_CDS maps show the results of blastp comparisons between the CDS translations extracted from the GenBank files. In these maps color is used to indicate the percent identify of BLAST hits. The BLAST hit rings are sorted such that the most similar genomes are presented first (closest to the outside of the circle).

Creating an all vs. all BLAST atlas using CCT

A wrapper script called build_blast_atlas_all_vs_all.sh is included with CCT. This script generates several CCT projects automatically, and then it combines the results into a single montage map. The montage consists of a separate map for each sequence of interest. This allows each sequence in a group of sequences to be visualized as the reference sequence.

To generate a separate BLAST atlas for each genome in a collection of genomes, first create a new project:
```
build_blast_atlas_all_vs_all.sh -p montage_project
```
This will produce a project directory called montage_project. For details on the project directory structure see Creating a New BLAST Atlas All vs All Project.

The configuration files project_settings_multi.conf can be edited prior to completion of the map drawing process (see Customizing CCT maps)
Place the GenBank files for the genomes in the montage_project/comparison_genomes directory. The files must end with a '.gbk' extension. In this example we are fetching all the available Escherichia genomes:
```
fetch_refseq_bacterial_genomes_by_name.sh \
  -n "Bordetella*" -o montage_project/comparison_genomes/
```
Run the build_blast_atlas_all_vs_all.sh command again:
```
build_blast_atlas_all_vs_all.sh -p montage_project
```
This will start the map creation process. Within the montage_project directory a separate directory for each map is created (one for each sequence). After all the maps are created a single montage of the maps is generated called montage.png in the montage_project directory.

The maps created using the build_blast_atlas_all_vs_all.sh script show the results of blastn comparisons by default.

Editing the project_settings.conf file

To adjust which types of analyses are performed for your sequence, you can edit the project_settings.conf file in your projects directory. Try the following:

Create a new project:

cgview_comparison_tool.pl -p my_project_2

Place a genome sequence in my_project_2/reference_genome:

cp \
$CCT_HOME/sample_projects/sample_project_2/reference_genome/Methanobacterium_thermoautotrophicum.gbk \
my_project_2/reference_genome

In this example we will compare the reference genome to a second genome, by placing a genome sequence in the my_project_2/comparison_genomes directory:
```
cp \
$CCT_HOME/sample_projects/sample_project_2/comparison_genomes/Methanosarcina_acetivorans.gbk \
my_project_2/comparison_genomes
```
Edit the my_project_2/project_settings.conf file. This file controls how CCT processes your project. For example, to perform a BLAST using the reference genome coding regions as queries, find the section called 'BLAST query source settings', and change:
Change:
```
query_source = none
```
→
To:
```
query_source = cds
```
Similarly, to specify how the comparison genomes are used in the BLAST comparison, find the section called 'BLAST database source settings' and change:
Change:
```
database_source = none
```
→
To:
```
database_source = trans
```
The settings in this example tell CCT to extract the coding sequence translations from the reference genome GenBank file, and BLAST them against the 6-frame translation of the genome sequences in the project's comparison_genomes directory.
You can also control how the results are presented using the 'Graphical map settings' section. For example, to draw feature labels, make the following change:
Change:
```
draw_feature_labels = F
```
→
To:
```
draw_feature_labels = T
```
To draw a larger map (this will allow the feature labels to fit on the canvas), make this change:
Change:
```
map_size = medium
```
→
To:
```
map_size = large
```
The various other settings are described in the project_settings.conf file.
Now that you have edited project_settings.conf, run CCT:
```
cgview_comparison_tool.pl -p my_project_2
```
This command will perform a BLAST analysis and create a map in my_project_2/maps. Whenever you make changes to the project_settings.conf file you can update the map using this command.

project_settings.conf file options

Attribute	Value	Description
minimum_orf_length	Integer	The minimum ORF length used (in codons) when ORFs are extracted from genomic sequences.
genetic_code	Integer	The genetic code to use for translated BLAST searches and for ORF translation. The default is the bacterial genetic code (genetic code 11). See https://bioinformatics.org/sms2/genetic_code.html for descriptions of the different genetic codes.
start_codons	Codons separated by '\|'	The start codons to be used when finding ORFs. The default set (ttg\|ctg\|att\|atc\|ata\|atg\|gtg) contains the starts for bacterial sequences.
stop_codons	Codons separated by '\|'	The stop codons to use when finding ORFs.
query_size	Integer	The query size for BLAST searches, i.e. how much of the reference genome is used in each BLAST search. This setting only applies to 'trans' and 'nucleotide' comparisons (see the query_source option below).
expect	Real	The BLAST expect value to use.
score	Integer	The minimum score required for BLAST hits.
hits	Integer	The number of BLAST hits to keep for each query.
minimum_hit_proportion	Real	The minimum acceptable hit length for BLAST results, expressed as a proportion of the length of the query.
query_source	nucleotide / trans / cds / orfs / none	The source of the BLAST query sequences. These sequences are extracted from the reference genome sequence, located in the reference_genome directory. Details on the different types can be found in the Blast Comparisons section.
database_source	nucleotide / dna / trans / cds / orfs / proteins / none	The sources of the BLAST databases. The databases are built using the sequences in the comparison_genomes directory. Details on the different types can be found in the Blast Comparisons section.
cog_source	orfs / cds / none	The proteins from the reference sequence to be assigned COG functional categories. Three options are available: orfs - translated ORFs from a GenBank (.gbk), FASTA (.fasta), or RAW (.raw) file. cds - CDS protein sequences extracted from a GenBank (.gbk) file. none - do not assign COG functional categories.
cog_top_hit	T / F	Whether to use only the top BLAST hit for COG functional assignment.
	T / F
draw_divider	T / F	Whether a divider should be drawn between the start and end of the sequence to indicate that the sequence is linear.
draw_orfs	T / F	Whether open reading frames (ORFs) in the reference genome should be drawn.
draw_gc_skew	T / F	Whether GC content in the reference genome should be drawn.
draw_legend	T / F	Whether a feature legend should be drawn.
draw_feature_labels	T / F	Whether features should be labeled. It is recommended that this option be set to 'T' only when 'large', 'x-large' or 'navigable' maps are drawn (see map_size).
draw_hit_labels	T / F	Whether BLAST hits should be labeled. It is recommended that this option be set to 'T' only when 'large', 'x-large' or 'navigable' maps are drawn (see map_size).
draw_orf_labels	T / F	Whether ORFs should be labeled. It is recommended that this option be set to 'T' only when 'large', 'x-large' or 'navigable' maps are drawn (see map_size).
draw_condensed	T / F	Whether thin feature rings should be used. This option is useful for maps that are to be used for analysis purposes rather than as a figure for publication.
draw_divider_rings	T / F	Whether divider rings should be drawn between feature rings.
draw_hits_by_reading_frame	T / F	Whether each set of BLAST results should be divided into six slots, based on the reading frame and strand of the query gene or ORF that produced the hit. This option only applies to comparisons done when the 'query_source' option is set to 'orfs' or 'cds'.
use_opacity	T / F	Whether BLAST hits should be drawn with partial opacity so that overlapping hits can be seen.
scale_blast	T / F	Whether BLAST hits should be drawn with height proportional to percent identity of hit.
gene_decoration	arc / arrow	Whether genes should be drawn as an arc or as an arrow.
highlight_query	T / F	Whether the position of the queries should be faintly highlighted on the map. By showing the query positions it is easier to see if a hit was obtained for specific ORFs or features.
map_size	small / medium / large / x-large or combination separated by commas	The size of the maps to draw. Multiple options can be separated by commas (e.g. small,large). small - 1000 x 1000 medium - 3000 x 3000 large - 9000 x 9000 x-large - 12000 x 12000

BLAST comparison types and the 'query_source' and 'database_source' settings

There are several types of BLAST comparisons that can be performed by CCT. The table below shows the compatible values for 'query_source' and 'database_source', lists the required reference and comparison sequence file types and file extensions, and describes the comparisons that are performed. Note that multiple comma-separated values can be given for 'query_source' and 'database_source'--CCT will perform all the compatible comparisons. Many different files can be included in the 'comparison_genomes' directory. CCT examines file extensions when deciding which files to include in each BLAST comparison. When there are multiple files with the same extension, a separate BLAST comparison is conducted for each, and the results are shown in separate rings on the resulting map.

query_source value	database_source value	reference_genome file types and file extensions required	comparison_genomes file types and file extensions required	description of BLAST comparison in the form 'reference vs comparison (BLAST type)'
nucleotide	nucleotide	GenBank (.gbk), FASTA (.fasta), RAW (.raw)	GenBank (.gbk), FASTA (.fasta), RAW (.raw)	DNA vs DNA (blastn)
nucleotide	dna	GenBank (.gbk), FASTA (.fasta), RAW (.raw)	One or more DNA sequences in FASTA format (.fna)	DNA vs DNA sequences (blastn)
trans	trans	GenBank (.gbk), FASTA (.fasta), RAW (.raw)	GenBank (.gbk), FASTA (.fasta), RAW (.raw)	6-frame translated DNA vs 6-frame translated DNA (tblastx)
trans	cds	GenBank (.gbk), FASTA (.fasta), RAW (.raw)	GenBank (.gbk)	6-frame translated DNA vs CDS protein sequences extracted from GenBank files (blastx)
trans	orfs	GenBank (.gbk), FASTA (.fasta), RAW (.raw)	GenBank (.gbk), FASTA (.fasta), RAW (.raw)	6-frame translated reference DNA vs translated ORFs (blastx)
trans	proteins	GenBank (.gbk), FASTA (.fasta), RAW (.raw)	One or more protein sequences in FASTA format (.faa)	6-frame translated reference DNA vs protein sequences (blastx)
trans	dna	GenBank (.gbk), FASTA (.fasta), RAW (.raw)	One or more DNA sequences in FASTA format (.fna)	6-frame translated reference DNA vs 6-frame translated DNA sequences (tblastx)
cds	trans	GenBank (.gbk)	GenBank (.gbk), FASTA (.fasta), RAW (.raw)	CDS protein sequences extracted from GenBank file vs 6-frame translated DNA (tblastn)
cds	cds	GenBank (.gbk)	GenBank (.gbk)	CDS protein sequences extracted from GenBank file vs CDS protein sequences extracted from GenBank files (blastp)
cds	orfs	GenBank (.gbk)	GenBank (.gbk), FASTA (.fasta), RAW (.raw)	CDS protein sequences extracted from GenBank file vs translated ORFs (blastp)
cds	proteins	GenBank (.gbk)	One or more protein sequences in FASTA format (.faa)	CDS protein sequences extracted from GenBank file vs protein sequences (blastp)
cds	dna	GenBank (.gbk)	One or more DNA sequences in FASTA format (.fna)	CDS protein sequences extracted from GenBank file vs 6-frame translated DNA sequences (tblastn)
orfs	trans	GenBank (.gbk), FASTA (.fasta), RAW (.raw)	GenBank (.gbk), FASTA (.fasta), RAW (.raw)	Translated ORFs vs 6-frame translated DNA (tblastn)
orfs	cds	GenBank (.gbk), FASTA (.fasta), RAW (.raw)	GenBank (.gbk)	Translated ORFs vs CDS protein sequences extracted from GenBank files (blastp)
orfs	orfs	GenBank (.gbk), FASTA (.fasta), RAW (.raw)	GenBank (.gbk), FASTA (.fasta), RAW (.raw)	Translated ORFs vs translated ORFs (blastp)
orfs	proteins	GenBank (.gbk), FASTA (.fasta), RAW (.raw)	One or more protein sequences in FASTA format (.faa)	Translated ORFs vs protein sequences (blastp)
orfs	dna	GenBank (.gbk), FASTA (.fasta), RAW (.raw)	One or more DNA sequences in FASTA format (.fna)	Translated ORFs vs 6-frame translated DNA sequences (tblastn)

Customizing CCT maps

There are many ways to modify the contents and appearance of CCT maps. See the Tutorials section for examples. The general approaches are described below.

Edit the project_settings.conf file or files that are created in the project directory. These files are used to control the types of BLAST searches that are performed, the types of graphs that are displayed, and a variety of other map characteristics.
Use the '--cct' option with the cgview_comparison_tool.pl script. This causes the BLAST results rings to take up more space, and causes them to be coloured according to the percent identity of each hit. This option is always used when the build_blast_atlas.sh and build_blast_atlas_all_vs_all.sh wrapper scripts are used.
Use the '-t' option with the cgview_comparison_tool.pl script. This causes the BLAST results rings to be ordered so that the ones containing more hits of high percent identity are drawn closest to the outer edge of the figure. This option is always used when the build_blast_atlas.sh and 'build_blast_atlas_all_vs_all.sh' wrapper scripts are used.
Use the '-b' option to control the number of BLAST rings shown. This option works with the cgview_comparison_tool.pl script and with the wrapper scripts build_blast_atlas.sh and build_blast_atlas_all_vs_all.sh. The default value for this option is 100, which means that up to 100 BLAST results rings will be shown. When the map is created using more than 100 comparison genomes or sequence collections, the top 100 rings (i.e. those producing the most high-identity hits) are shown.
After the maps have been drawn, edit the CGView XML files (in maps/cgview_xml) and then redraw the maps using the redraw_maps.sh script:
```
redraw_maps.sh -p my_project
```
In the example above, the redraw_maps.sh script is used to draw maps from the CGView XML files located in the my_project CCT project. More information on the CGView XML format is available on the CGView website.
Use the '--custom' option to fine-tune the appearance of the map. This option works with the cgview_comparison_tool.pl script and with the wrapper scripts build_blast_atlas.sh and build_blast_atlas_all_vs_all.sh. This option is used to supply key-value pairs, as in the following example:
```
build_blast_atlas.sh -p NC_012920 -m 2500m -b 2500 \
  --map_size x-large --custom 'width=20000 height=20000 backboneRadius=8000 \
  featureThickness=120 rulerFontSize=100 rulerPadding=200 \
  tickThickness=15 tickLength=40 draw_divider_rings=F \
  _cct_blast_thickness=2.0'
```
Typically the '--custom' option is used after a map has been created, to adjust the appearance of the map. To see what key-value pairs are available see the Customization keys table. To see what values were used for each key when a map was created, examine the '.log' files located in the maps/cgview_xml directory. Once you've determined the keys you would like to change, you can rerun the script using the '--custom' option with the keys and their new values. However, if a lot of BLAST searches were performed to build the first map you can reuse these BLAST results by starting the BLAST atlas at CGView XML creation. For example, suppose that after examining maps already created with the build_blast_atlas.sh script, you decide that you prefer the x-large map, but that you want the make the backbone circle larger. You examine the x-large.log file and find a section showing the key-value attributes for the map, and you see that the 'backboneRadius' key was assigned a value of '4000'. To redraw the x-large maps for the DNA vs DNA and the CDS vs CDS comparisons, without having the BLAST searches repeated, the following commands could be used:
```
build_blast_atlas.sh -p NC_012920 --start_at_xml --custom 'backboneRadius=4500' --map_size x-large
```
The '--start_at_xml' option causes the script to rebuild the XML and the '--map_size x-large' option only redraws the x-large maps.
Draw zoomed maps that show regions of interest in more detail. For example, suppose that the build_blast_atlas.sh script was run as follows:
```
build_blast_atlas.sh -i NC_012920.gbk
```
If after examining some of the maps you find that the 400000 bp region of the reference genome looks interesting you can generate a zoomed version of all the maps showing this region in more detail using the following command:
```
create_zoomed_maps.sh -p my_project -c 400000 -z 10 --format svgz
```
This will create new maps in the same directories as the existing maps, showing the 400000 bp region expanded by a factor of 10. Instead of the default PNG format the maps will be generated in SVGZ format.

Labelling a subset of genes (labels_to_show.txt)

To label a subset of genes in the reference genome, place a file called labels_to_show.txt in the project directory. This file should be a tab-delimited or comma-delimited text file specifying which genes should be labeled. Each row must consist of a gene identifier followed by the text that is to be used for the label. When using a GenBank or EMBL file as the reference genome the gene identifier should match the value of the '/gene' qualifier (or the value of the '/locus_tag' qualifier if there isn't a '/gene' qualifier given for a particular gene). When describing genes using the 'features' directory and .gff files, the gene identifier should match the 'seqname' value. Note that providing a labels_to_show.txt file will cause the 'draw_feature_labels' setting in the 'project_settings.conf' file to be ignored.

Adding additional features (feature GFF files)

To add features to the map, place one or more files with a '.gff' extension in the features directory, which is located in the CCT project directory. The files should be tab-delimited or comma-delimited and should have the following column titles, in the following order: 'seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame'. The first line in the file must be the column titles. For a given entry, 'seqname' should be the name of the gene, 'feature' should be the type of gene (CDS, rRNA, tRNA, other) or the single letter COG category (J for example). 'start' and 'end' should be integers between 1 and the length of the sequence, and the 'start' value should be less than or equal to the 'end' regardless of the 'strand' value. The 'strand' value should be '+' for the forward strand and '-' for the reverse strand. All other values can be given as '.' or left blank, since they are ignored. These column titles are based on the specification of the GFF file format. If 'start' and 'end' values are not supplied, but a 'seqname' is given, this script will attempt to get the 'start' and 'end' values from the sequence file.

Adding additional analysis results (analysis GFF files)

To add analysis results to the map, place one or more files with a '.gff' extension in the analysis directory, which is located in the CCT project directory. The files should be tab-delimited or comma-delimited and should have the following column titles, in the following order: 'seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame'. The first line in the file must be the column titles. For a given entry, only the 'start', 'end', 'strand', and 'score' values are required. 'start' and 'end' should be integers between 1 and the length of the sequence, and the 'start' value should be less than or equal to the 'end' regardless of the 'strand' value. The 'strand' value should be '+' for the forward strand and '-' for the reverse strand. The 'score' value should be a real number, positive or negative. The other values can be given as '.' or left blank. These column titles are based on the specification of the GFF file format. If 'start' and 'end' values are not supplied, but a 'seqname' is given, this script will attempt to get the 'start' and 'end' values from the sequence file.

COG categories and colours

Category		Description
Information storage and processing [oranges/reds]
A	Red	RNA processing and modification
B	Tomato	Chromatin structure and dynamics
J	Light coral	Translation, ribosomal structure and biogenesis
K	Dark orange	Transcription
L	Deep pink	Replication, recombination and repair
Cellular processes and signaling [greens/yellows]
D	Khaki	Cell cycle control, cell division, chromosome partitioning
O	Dark khaki	Post-translational modification, protein turnover, and chaperones
M	Olive drab	Cell wall/membrane/envelope biogenesis
N	Forest green	Cell motility
P	Yellow green	Inorganic ion transport and metabolism
T	Lime green	Signal transduction mechanisms
U	Green yellow	Intracellular trafficking, secretion, and vesicular transport
V	Medium spring green	Defense mechanisms
W	Dark sea green	Extracellular structures (this doesn't appear in reference database)
Y	Medium sea green	Nuclear structure (this appears once in reference database)
Z	Yellow	Cytoskeleton
Metabolism [blues/purples]
C	Cyan	Energy production and conversion
G	Dark turquoise	Carbohydrate transport and metabolism
E	Steel blue	Amino acid transport and metabolism
F	Deep sky blue	Nucleotide transport and metabolism
H	Blue	Coenzyme transport and metabolism
I	Slate blue	Lipid transport and metabolism
Q	Navy	Secondary metabolites biosynthesis, transport, and catabolism
Poorly characterized [grays]
R	Gray	General function prediction only (examples include "Predicted thioesterase", "Predicted ATPase")
S	Dark gray	Function unknown (examples include "Uncharacterized conserved protein", "Predicted small secreted protein")
Unknown	White	Not assigned COG letter because protein is not similar to any COG