Difference between revisions of "Allele definition"

(Which definition should we stick to?)
 
(41 intermediate revisions by the same user not shown)
Line 1: Line 1:
==How to define PGx alleles==
+
==How to obtain PGx allele definitions from literature==
PGx alleles are defined as collections of one or more SNPs, INDELs or structural variants. When a patient is sequenced by next generation sequencing (NGS) we may typically observe more variants than those which are included in any individual PGx allele definitions.  
+
PGx alleles are collected and distributed through various channels
[[Fil:Variant tree outline.png|miniatyr|The 16 possible haplotypes for a four loci, decomposed variant calling]]
+
* PGx alleles on JSON-LD format from the [https://api.pharmgkb.org PharmGKB API]
 +
* PGx alleles on Excel-style formats, also accessible through the PharmGKB API (seems to be hidden from the Swagger documentation, but direct links when searching for haplotype definitions at the pharmgkb.org website)
 +
* [https://github.com/PharmGKB/PharmCAT/tree/master/src/main/resources/org/pharmgkb/pharmcat/definition/alleles PGx alleles for use in PharmCAT] are included in their source code. It is reasonable to suspect that these files are parsed from the Excel files.
 +
* [https://github.com/inumanag/aldy/tree/master/aldy/resources/genes PGx alles for use in Aldy] are included in their source code. Due to the lack of GRCh37 variant definitions for CYP2D6, Aldy has likely lifted over GRCh38 variants to GRCh37. The Aldy developers have also added new star alleles with (non functional) variants that are often observed together with the main star allele.
 +
* PGx alleles as VCF files from [https://www.pharmvar.org PharmVar]
 +
 
 +
==How to chose the correct PGx allele definitions==
 +
===Chosing an appropriate genomic reference build===
 +
PGx allele definitions are given in either GRCh37 or GRCh38 reference coordinates.
 +
* PharmCAT, CPIC and PharmGKB as a rule use build GRCh38.
 +
* Aldy uses GRCh37.
 +
* The PharmGKB API as a rule is using GRCh37 alleles, but not for all variants. Moreover, the build is given as "hg38", even when the actual coordinates are mostly GRCh37 (hg19). In the PharmGKB and CPIC Excel sheets, however, the move to GRCh38 is completed.
 +
 
 +
As is evident from liftOver, hg19 is not equal to GRCHh37, nor hg38 to GRCHh38. This could also be a source of error.
 +
 
 +
===Chosing the correct variants to include in the PGx alleles===
 +
The allele definitions from PharmGKB and PharmVar are not always one-to-one, and some background knowledge about why this is, is required.
 +
 
 +
Additional prefiltering of PharmGKB or PharmVar allele definitions by domain experts may be needed.
 +
* Prefiltering was performed by the [[PGx in Estonia|PGx pipeline of the University of Tartu]], and a list of the resulting variants were published in the supplementary material to [https://www.biorxiv.org/content/early/2018/07/04/356204 ''Reisberg et al.'']. The variants per star allele are, however, not written out explicitly.
 +
* Prefiltering was performed on allele definitions in Aldy and the [https://github.com/inumanag/aldy/tree/master/aldy/resources/genes result published in their code].
 +
 
 +
An example that illustrates many definition problems is CYP2C19*19:
 +
 
 +
{| class="wikitable"
 +
|-
 +
! CYP2C19*19 !! PharmGKB !! PharmVar !! PharmCAT !! Comment
 +
|-
 +
| NC_000010.10(GRCh37) || Source API: g.96522561T(rs17885098), g.96602623G(rs3758581), g.96522613A>G, g.96609568T>C(rs4917623)|| Source VCF: g.96521422A>G(rs7902257), g.96522613A>G || Source GitHub: g.96522613A>G(by liftOver) || Disagree on rs4917623(intron), rs7902257(2kb upstream variant). rs17885098 and rs3758581 are included here because PharmGKB assume different reference bases for these positions (seems like a problem caused by change of Major allele between reference builds GRCh37/GRCh38. Note that refSeq also agrees with GRCh38 and not GRCh37)
 +
|-
 +
| NC_000010.11(GRCh38) || Source Pharmgkb.org: g.94762856A>G || Source VCF: g.94762804C>T(rs17885098), g.94762856A>G, g.94842866A>G(rs3758581) || Source GitHub: g.94762856A>G || PharmGKB does not agree with itself when reporting GRCh38 variants on the homepage and GRCh37 variants in the API. The differences probably caused by non-standard use of Major/Minor Allele (rs17885098, rs3758581) with respect to dbSNP, and removing intron variants (rs4917623) may be sensible from an exon/protein-coding view.
 +
|}
 +
 
 +
The main problem with the changes in definitions is that the same patient may be given different PGx-advice depending on the build version of the pipeline (unless of course that the haplotype is always conserved)
 +
 
 +
Comparisons between PGx genotyping tools can give some insight into the accuracy of allele definitions. In Aldy they have [https://github.com/inumanag/aldy-paper-resources compared their method to targeted sequencing] with good results.
 +
 
 +
==How to define PGx alleles for next generation sequencing==
 +
As we saw in the previous section, PGx alleles are defined as collections of one or more SNPs, INDELs or structural variants. When a patient is sequenced by next generation sequencing ([[NGS]]) technology we typically observe more variants than those which are included in any individual PGx allele definitions.  
 +
[[File:Variant tree outline.png|thumb|The 16 possible haplotypes for a four loci, decomposed variant calling]]
 
This means that
 
This means that
 
*Patients may have a large, ambiguous number of matching PGx alleles
 
*Patients may have a large, ambiguous number of matching PGx alleles
 
*Patients may have additional variants that may modify the effect of a known PGx allele
 
*Patients may have additional variants that may modify the effect of a known PGx allele
We can illustrate all the possible PGx allele definitions for a four loci PGx haplotype as
+
We illustrate some of the problems that we encountered when trying to match patient haplotypes to the PGx allele definitions, by a four loci PGx gene
  
==The SNP array method==
+
===The SNP array method===
[[Fil:Variant tree allele snp definition.png|miniatyr|PGx alleles defined as collections of variants, with no requirement on loci that are not part of the definition, will assign the same PGx allele to several different haplotypes]]
+
[[File:Variant tree allele snp definition.png|thumb|PGx alleles defined as collections of variants, with no requirement on loci that are not part of the definition, will assign the same PGx allele to several different haplotypes]]
 
This definition only requires matches for variants explicitly included in PGx allele definitions.  
 
This definition only requires matches for variants explicitly included in PGx allele definitions.  
  
 
This means that
 
This means that
*Several PGx alleles may match to the patient
+
*Several PGx alleles may match the patient
 
*But the presence of additional variants will have no effect on reported PGx alleles
 
*But the presence of additional variants will have no effect on reported PGx alleles
  
==The PharmCAT method==
+
===The PharmCAT method===
This definition require matches also for variants not explicitly included in PGx allele definitions.  
+
This definition requires matches also for variants not explicitly included in PGx allele definitions.  
[[Fil:Variant tree allele pharmcat definition.png|miniatyr|PGx alleles defined as complete haplotypes classifies the patient uniquely]]
+
[[File:Variant tree allele pharmcat definition.png|thumb|PGx alleles defined as complete haplotypes classifies the patient uniquely]]
 
This means that
 
This means that
 
*Only one PGx allele can exist simultaneously for the same patient
 
*Only one PGx allele can exist simultaneously for the same patient
 
*But whenever we have additional variants, no PGx alleles will be reported
 
*But whenever we have additional variants, no PGx alleles will be reported
 +
(Note that in practice PharmCAT lets the user decide which allele definitions to use in their [https://github.com/PharmGKB/PharmCAT/wiki/NamedAlleleMatcher-101 NamedAlleleMatcher])
 +
 +
===The Aldy method===
 +
The Aldy method is similar to the PharmCAT method, but with some important notes
 +
- The method takes into account the copy number of each variant
 +
- The method introduces additional PGx variants in order to avoid no-calls. The additional variants have been interpreted and curated by the Aldy team.
 +
- The method uses BAM-files instad of VCF files.
  
==Which definition should we stick to?==
+
===Which definition should we stick to?===
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
Line 29: Line 75:
 
| SNP array method || Compatible with previous SNP array methods. Assigns PGx alleles to the maximum number of patients || Multiple PGx alleles are possible
 
| SNP array method || Compatible with previous SNP array methods. Assigns PGx alleles to the maximum number of patients || Multiple PGx alleles are possible
 
|-
 
|-
| PharmCAT method || One PGx allele per patient || Less compatible with previous SNP array methods. Many patients are not assigned to a known PGx allele
+
| PharmCAT method || One PGx allele per patient || Less compatible with previous SNP array methods. Some patients are no longer assigned to a known PGx allele
 +
|-
 +
| Aldy method || One PGx allele per patient, detection of new variants || [https://github.com/inumanag/aldy-paper-resources Test data sets for Aldy] shows that the method performs well, also with respect to targeted methods
 
|}
 
|}
 +
 +
For now, due to the good performance and documentation of Aldy, this is our preferred method for genotyping.

Latest revision as of 09:37, 27 February 2019

How to obtain PGx allele definitions from literature

PGx alleles are collected and distributed through various channels

  • PGx alleles on JSON-LD format from the PharmGKB API
  • PGx alleles on Excel-style formats, also accessible through the PharmGKB API (seems to be hidden from the Swagger documentation, but direct links when searching for haplotype definitions at the pharmgkb.org website)
  • PGx alleles for use in PharmCAT are included in their source code. It is reasonable to suspect that these files are parsed from the Excel files.
  • PGx alles for use in Aldy are included in their source code. Due to the lack of GRCh37 variant definitions for CYP2D6, Aldy has likely lifted over GRCh38 variants to GRCh37. The Aldy developers have also added new star alleles with (non functional) variants that are often observed together with the main star allele.
  • PGx alleles as VCF files from PharmVar

How to chose the correct PGx allele definitions

Chosing an appropriate genomic reference build

PGx allele definitions are given in either GRCh37 or GRCh38 reference coordinates.

  • PharmCAT, CPIC and PharmGKB as a rule use build GRCh38.
  • Aldy uses GRCh37.
  • The PharmGKB API as a rule is using GRCh37 alleles, but not for all variants. Moreover, the build is given as "hg38", even when the actual coordinates are mostly GRCh37 (hg19). In the PharmGKB and CPIC Excel sheets, however, the move to GRCh38 is completed.

As is evident from liftOver, hg19 is not equal to GRCHh37, nor hg38 to GRCHh38. This could also be a source of error.

Chosing the correct variants to include in the PGx alleles

The allele definitions from PharmGKB and PharmVar are not always one-to-one, and some background knowledge about why this is, is required.

Additional prefiltering of PharmGKB or PharmVar allele definitions by domain experts may be needed.

An example that illustrates many definition problems is CYP2C19*19:

CYP2C19*19 PharmGKB PharmVar PharmCAT Comment
NC_000010.10(GRCh37) Source API: g.96522561T(rs17885098), g.96602623G(rs3758581), g.96522613A>G, g.96609568T>C(rs4917623) Source VCF: g.96521422A>G(rs7902257), g.96522613A>G Source GitHub: g.96522613A>G(by liftOver) Disagree on rs4917623(intron), rs7902257(2kb upstream variant). rs17885098 and rs3758581 are included here because PharmGKB assume different reference bases for these positions (seems like a problem caused by change of Major allele between reference builds GRCh37/GRCh38. Note that refSeq also agrees with GRCh38 and not GRCh37)
NC_000010.11(GRCh38) Source Pharmgkb.org: g.94762856A>G Source VCF: g.94762804C>T(rs17885098), g.94762856A>G, g.94842866A>G(rs3758581) Source GitHub: g.94762856A>G PharmGKB does not agree with itself when reporting GRCh38 variants on the homepage and GRCh37 variants in the API. The differences probably caused by non-standard use of Major/Minor Allele (rs17885098, rs3758581) with respect to dbSNP, and removing intron variants (rs4917623) may be sensible from an exon/protein-coding view.

The main problem with the changes in definitions is that the same patient may be given different PGx-advice depending on the build version of the pipeline (unless of course that the haplotype is always conserved)

Comparisons between PGx genotyping tools can give some insight into the accuracy of allele definitions. In Aldy they have compared their method to targeted sequencing with good results.

How to define PGx alleles for next generation sequencing

As we saw in the previous section, PGx alleles are defined as collections of one or more SNPs, INDELs or structural variants. When a patient is sequenced by next generation sequencing (NGS) technology we typically observe more variants than those which are included in any individual PGx allele definitions.

The 16 possible haplotypes for a four loci, decomposed variant calling

This means that

  • Patients may have a large, ambiguous number of matching PGx alleles
  • Patients may have additional variants that may modify the effect of a known PGx allele

We illustrate some of the problems that we encountered when trying to match patient haplotypes to the PGx allele definitions, by a four loci PGx gene

The SNP array method

PGx alleles defined as collections of variants, with no requirement on loci that are not part of the definition, will assign the same PGx allele to several different haplotypes

This definition only requires matches for variants explicitly included in PGx allele definitions.

This means that

  • Several PGx alleles may match the patient
  • But the presence of additional variants will have no effect on reported PGx alleles

The PharmCAT method

This definition requires matches also for variants not explicitly included in PGx allele definitions.

PGx alleles defined as complete haplotypes classifies the patient uniquely

This means that

  • Only one PGx allele can exist simultaneously for the same patient
  • But whenever we have additional variants, no PGx alleles will be reported

(Note that in practice PharmCAT lets the user decide which allele definitions to use in their NamedAlleleMatcher)

The Aldy method

The Aldy method is similar to the PharmCAT method, but with some important notes - The method takes into account the copy number of each variant - The method introduces additional PGx variants in order to avoid no-calls. The additional variants have been interpreted and curated by the Aldy team. - The method uses BAM-files instad of VCF files.

Which definition should we stick to?

Method Advantages Disadvantages
SNP array method Compatible with previous SNP array methods. Assigns PGx alleles to the maximum number of patients Multiple PGx alleles are possible
PharmCAT method One PGx allele per patient Less compatible with previous SNP array methods. Some patients are no longer assigned to a known PGx allele
Aldy method One PGx allele per patient, detection of new variants Test data sets for Aldy shows that the method performs well, also with respect to targeted methods

For now, due to the good performance and documentation of Aldy, this is our preferred method for genotyping.