Post

MADC File Format

Description of the DArT MADC file

MADC File Format

Explanation of the MADC file format and the conversion to fixed allele ID at BI.

flowchart TD
    A(["Genotype Sequencing"]) --> B["MADC File"]
    B --> D["Fixed Allele IDs Assigned"]
    D --> n3["Kinship Matrix"] & n4["Convert to VCF"]
    B@{ shape: div-proc}
    D@{ shape: procs}
    n3@{ shape: db}
    n4@{ shape: db}
     A:::Aqua
     B:::Sky
     D:::Sky
     n3:::Peach
     n4:::Peach
    classDef Aqua stroke-width:1px, stroke-dasharray:none, stroke:#46EDC8, fill:#DEFFF8, color:#378E7A
    classDef Sky stroke-width:1px, stroke-dasharray:none, stroke:#374D7C, fill:#E2EBFF, color:#374D7C
    classDef Peach stroke-width:1px, stroke-dasharray:none, stroke:#FBB35A, fill:#FFEFDB, color:#8F632D

DArT MADC File

The DArT MADC (Missing Allele Discovery Counts) file is a comma-separated file provided by DArT from a sequencing project. It includes the following key columns:

  • AlleleID: Typically contains the chromosome, position, reference or alternative allele designation, and the allele ID according to the haplotype database (e.g., Chr01_000084128|Ref_0001).
  • CloneID: Represents the chromosome and position (e.g., Chr01_000084128).
  • AlleleSequence: The amplicon sequence (e.g., GAGTGTGAAGATTTGGACAAAAGAGGTTGGTTTTTACTGTTATGGCATTTATCTCCTTATAAAATTTTGTATTTTTTTTGT).
  • Sample Columns: Named with sample IDs and containing counts of each allele per sample.

Example MADC Example Missing Allele Discovery (MADC) file with fixed allele IDs

Important: BIGr requires fixed AlleleIDs tagged with a respective number (e.g., suffix: _0001). Fixed AlleleIDs are generated by matching the MADC contents to the haplotype database. If your MADC file does not contain fixed AlleleIDs, please contact Breeding Insight.

Target vs. Off-Target SNPs

If you provide a MADC file, you can choose between extracting:

  • Only target SNPs – These are the SNPs for which probes were specifically designed. If your panel includes 3,000 SNPs, your output VCF file should contain exactly 3,000 target SNPs.
  • Both target and off-target SNPs – In addition to target SNPs, BIGr will identify other SNPs within the amplicon region.

Target SNPs

If you decide to extract only the target SNPs, you can choose if reference (REF) and alternative (ALT) bases should also be extracted. For extracting this information from the MADC file, BIGr compares the reference sequences with the alternative, finds the polymorphic site and notate the changing base. This requires:

  • Reference (Ref_0001) and alternative (Alt_0002) sequences for each tag. If one or both is missing, the tag will be discarded.
  • A single polymorphism between them. If more than one is found, the tag is discarded
  • A .botloci file to inform which tag sequences should be converted to its reverse complement. For BI-supported marker panel species this file is already embedded within the app, no upload is required.

If you choose to not extract REF and ALT information, all tags will be kept, but your VCF will have missing data (.) in the REF and ALT fields. The only process in BIGr that will require this information is the Dosage Calling with PolyRAD.

Target and off-target (all) SNPs

To identify off-target SNPs, BIGr aligns each amplicon against its reference (via Bioconductor’s Biostrings + pwalign) to uncover additional polymorphisms. This procedure requires:

  • Reference & alternative sequences (e.g. Ref_0001, Alt_0002) for every tag
    • If a FASTA haplotype database is supplied, any missing sequences in the MADC file will be retrieved automatically.
    • Without a FASTA database, tags lacking either sequence are discarded.
  • Exactly one polymorphic site
    • Tags with zero or multiple variants are excluded.
  • .botloci file
    • Specifies which tags must be reverse-complemented. For BI-supported species, this file is bundled within the app—no upload needed.
This post is licensed under CC BY 4.0 by the author.

Trending Tags