Main Program/Method name: Earl Grey
Main Program/Method version (if applicable): v3.2.1
Tutorial Author(s): Tobias Baril
Last Update: 13/11/2023
Here is a tutorial to set up and run Earl Grey. Earl Grey is a fully-automated TE curation and annotation pipeline designed to improve on initial RepeatModeler2 families through a BLAST, Extend, Align, Trim process. Following library generation and genome annotation, the pipeline automatically defragments TE annotations and creates user-friendly summaries and figures.
A terminal environment with an anaconda distribution installed. Alternatively, Earl Grey can be used in any web browser via gitpod.io.
conda create -n earlgrey -c conda-forge -c bioconda earlgrey
If you want to use Earl Grey on gitpod:
https://github.com/TobyBaril/EarlGrey
/workspace/
and use the earlGrey
command as described below.Once this has successfully installed, you can run Earl Grey on your favourite genome assembly in FASTA format. For the default options (de novo annotation, library curation, and subsequent annotation):
conda activate earlgrey
earlGrey -g [genome.fasta] -s [species_name_to_save_files] -t [integer_threads] -o [output_directory]
# Example
earlGrey -g /home/Toby/genomes/homoSapiens.fasta -s homoSapiensTest -t 16 -o /home/Toby/results/
-g : path to input genome in Fasta format
-s : identifier to save results
-o : path to output directory
-t : number of threads to use for computation
-r : RepeatMasker species term if you would like to mask the input genome with known repeats prior to de novo curation (e.g lepidoptera)
-l : path to existing TE library in Fasta format to mask the input genome with prior to de novo curation (cannot be used in conjunction with -r)
-i : integer max number of iterations to run through the BLAST, Extend, Align, Trim process (default 10)
-f : number of flanking base pairs to add to consensus sequences in each round of the BLAST, Extend, Align, Trim process (default 1000)
-c : yes|no for clustering of TE consensus sequences to the 80-80-80 rule (default no)
-h : show help
[species]EarlGrey/[species]_summaryFiles/
.
tclassif cov count proportion gen Number_of_Distinct_Classifications
DNA 71080 47 0.000953350141874542 74558126 1
LINE 9719430 7288 0.130360438512095 74558126 46
LTR 24131541 24825 0.323660777096248 74558126 146
Other (Simple Repeat, Microsatellite, RNA) 159031 418 0.0021329801127244 74558126 87
Penelope 74993 93 0.00100583268415303 74558126 3
Unclassified 16867511 46883 0.226233033271249 74558126 350
name coverage copy_number
RND-1_FAMILY-0#LTR/Gypsy 2466202 1254
RND-1_FAMILY-166#LTR/Gypsy 1629372 566
RND-1_FAMILY-133#LTR/Gypsy 1422586 1072
RND-1_FAMILY-408#LTR/Gypsy 1416551 1405
RND-1_FAMILY-76#LTR/Gypsy 1308630 387
RND-1_FAMILY-421#LTR/Gypsy 1298115 868
RND-1_FAMILY-77#LINE/Tad1 1036048 276
RND-1_FAMILY-33#LINE/Tad1 880704 480
RND-1_FAMILY-365#LTR/Gypsy 783576 238
# BED format
NC_045808.1 4964941 4965925 LINE/Penelope 5073 +
NC_045808.1 7291353 7291525 LINE/L2 1279 +
NC_045808.1 8922477 8923791 DNA/TcMar-Tc1 11957 +
# GFF3 format
NC_045808.1 RepeatMasker LINE/Penelope 4964942 4965925 5073 + . ID="RND-1_FAMILY-48";Tend="2677";Tstart="1700";shortTE="F";uid="cb016329-11ac-492a-b11c-db2dc50e9d89"
NC_045808.1 RepeatMasker LINE/L2 7291354 7291525 1279 + . ID="RND-5_FAMILY-151";Tend="1398";Tstart="1226";shortTE="F";uid="b1ffaec9-e362-4ffe-89c2-5830cea00ab7"
NC_045808.1 RepeatMasker DNA/TcMar-Tc1 8922478 8923791 11957 + . ID="RND-1_FAMILY-124";Tend="3292";Tstart="1979";shortTE="F";uid="1e507e68-39f3-45f0-b62b-65bd9b7919c8"
>rnd-1_family-31#LINE
TGTAACATACACAACACTTCTTTCCTCCTACCACCATCTTGTGTGGTTAG
CGCTCATAGGAAGGAATAGCAATTTCTTTGTTTTTACACTTCGAGCAAGT
...