Skip to main content

Table 2 Predictor variables derived from the repeat and flanking sequences whose impact on variability were tested by regression.

From: Marked variation in predicted and observed variability of tandem repeat loci across the human genome

Predictor name

Description

Statistics About The Tandem Repeat

pop_size

Population size: number of unique sequences from which estimate of repeat variability was obtained

entropy

Based on percentage composition:

ω * log(ω) where ω is A, C, G and T. This is 0 when the repeat array consists of only one nucleotide and 2 when all nucleotides are equal [16]

T

%T in the repeat, e.g. 50% in TGTGTGTG

G

%G in the repeat, e.g. 50% in TGTGTGTG

C

%C in the repeat, e.g. 50% in CACACACA

A

%A in the repeat, e.g. 50% in CACACACA

score

Tandem Repeat Finder (TRF) [16] program-derived overall score

%indels

inferred consensus [16]

%match

% matches between actual repeat units and the inferred consensus [16]

unit_length

Length of tandem repeat unit, e.g. 2 for TGTGTGTG

blocklength

Length of the tandem repeat array, e.g. 8 for TGTGTGTG

copy_number

Number of copies of repeat unit, e.g. 4 for TGTGTGTG

CG, CA, AC, AG, GA, CC, AT, TA, GC, AA

Observed/expected dinucleotide bias of 10 dimers in the tandem repeat array [24]

tm_repeat

Melting temperature of the sequence [26]

G+C_repeat

Fraction of the sequence represented by the bases G or C, e.g. 0.5 for TGTGTGTG

RNA_free_energy

Free energy of the tandem repeat sequence RNA secondary structure [37]

Statistics From 20 bp/500 bp Flanks of the Repeat

tm_flank20, tm_flank500

Melting temperature of the sequence [26]

G+C_flank20, G+C_flank500

Fractional G+C content of the two 20 bp/500 bp flanking sequences

CpG_flank20, CpG_flank500

Number of CpG (CG) dinucleotides in the two 20 bp/500 bp flanking sequences

num_SNPs

Total number of SNPs in the two 500 bp flanks

SNP_allele_freq

Mean SNP minor allele frequency for SNPs in the two 500 bp flanks

Statistics on distance to nearest neighbouring elements

nearest_promoter

Distance in nucleotides from the nearest promoter

nearest_gene/CDS

Distance in nucleotides from the nearest gene/CDS

nearest_MCS

Distance in nucleotides from the nearest Multi-species Conserved Sequences defined in [33]

nearest_CpG

Distance in nucleotides from the nearest CpG island (defined by the UCSC genome browser)

nearest_regulatory

Distance in nucleotides from the nearest regulatory region (UCSC genome browser)