Python Client Library

This is the GA4GH client API library. This is a convenient wrapper for the low-level HTTP GA4GH API, and abstracts away network centric details such as paging. The methods and types used by the client library are defined by the GA4GH schema.

Warning

This client API should be considered early alpha quality, and may change in arbitrary ways. In particular, the current camelCase convention for identifiers may change to snake_case in future.

Todo

A full description of this API and links to a tutorial on how to use it, as well as a quickstart showing the basic usage.

Types

Todo

A short overview of the types and links to the high-level docs.

References

class ga4gh.protocol.ReferenceSet(**kwargs)

A ReferenceSet is a set of References which typically comprise a reference assembly, such as GRCh38. A ReferenceSet defines a common coordinate space for comparing reference-aligned experimental data.

assemblyId

Public id of this reference set, such as GRCh37.

description

Optional free text description of this reference set.

id

The reference set ID. Unique in the repository.

isDerived

A reference set may be derived from a source if it contains additional sequences, or some of the sequences within it are derived (see the definition of isDerived in Reference).

md5checksum

Order-independent MD5 checksum which identifies this ReferenceSet. To compute this checksum, make a list of Reference.md5checksum for all References in this set. Then sort that list, and take the MD5 hash of all the strings concatenated together. Express the hash as a lower-case hexadecimal string.

name

The reference set name.

ncbiTaxonId

ID from http://www.ncbi.nlm.nih.gov/taxonomy (e.g. 9606->human) indicating the species which this assembly is intended to model. Note that contained References may specify a different ncbiTaxonId, as assemblies may contain reference sequences which do not belong to the modeled species, e.g. EBV in a human reference genome.

sourceAccessions

All known corresponding accession IDs in INSDC (GenBank/ENA/DDBJ) ideally with a version number, e.g. NC_000001.11.

sourceURI

Specifies a FASTA format file/string.

class ga4gh.protocol.Reference(**kwargs)

A Reference is a canonical assembled contig, intended to act as a reference coordinate space for other genomic annotations. A single Reference might represent the human chromosome 1, for instance. References are designed to be immutable.

id

The reference ID. Unique within the repository.

isDerived

A sequence X is said to be derived from source sequence Y, if X and Y are of the same length and the per-base sequence divergence at A/C/G/T bases is sufficiently small. Two sequences derived from the same official sequence share the same coordinates and annotations, and can be replaced with the official sequence for certain use cases.

length

The length of this reference’s sequence.

md5checksum

The MD5 checksum uniquely representing this Reference as a lower-case hexadecimal string, calculated as the MD5 of the upper-case sequence excluding all whitespace characters (this is equivalent to SQ:M5 in SAM).

name

The name of this reference. (e.g. ‘22’).

ncbiTaxonId

ID from http://www.ncbi.nlm.nih.gov/taxonomy (e.g. 9606->human).

sourceAccessions

All known corresponding accession IDs in INSDC (GenBank/ENA/DDBJ) which must include a version number, e.g. GCF_000001405.26.

sourceDivergence

The sourceDivergence is the fraction of non-indel bases that do not match the reference this record was derived from.

sourceURI

The URI from which the sequence was obtained. Specifies a FASTA format file/string with one name, sequence pair. In most cases, clients should call the getReferenceBases() method to obtain sequence bases for a Reference instead of attempting to retrieve this URI.

Datasets

class ga4gh.protocol.Dataset(**kwargs)

A Dataset is a collection of related data of multiple types. Data providers decide how to group data into datasets. See [Metadata API](../api/metadata.html) for a more detailed discussion.

description

Additional, human-readable information on the dataset.

id

The dataset’s id, locally unique to the server instance.

name

The name of the dataset.

Variants

class ga4gh.protocol.VariantSet(**kwargs)

A VariantSet is a collection of variants and variant calls intended to be analyzed together.

datasetId

The ID of the dataset this variant set belongs to.

id

The variant set ID.

metadata

Optional metadata associated with this variant set. This array can be used to store information about the variant set, such as information found in VCF header fields, that isn’t already available in first class fields such as “name”.

name

The variant set name.

referenceSetId

The ID of the reference set that describes the sequences used by the variants in this set.

class ga4gh.protocol.CallSet(**kwargs)

A CallSet is a collection of calls that were generated by the same analysis of the same sample.

created

The date this call set was created in milliseconds from the epoch.

id

The call set ID.

info

A map of additional call set information.

name

The call set name.

sampleId

The sample this call set’s data was generated from. Note: the current API does not have a rigorous definition of sample. Therefore, this field actually contains an arbitrary string, typically corresponding to the sampleId field in the read groups used to generate this call set.

updated

The time at which this call set was last updated in milliseconds from the epoch.

variantSetIds

The IDs of the variant sets this call set has calls in.

class ga4gh.protocol.Variant(**kwargs)

A Variant represents a change in DNA sequence relative to some reference. For example, a variant could represent a SNP or an insertion. Variants belong to a VariantSet. This is equivalent to a row in VCF.

alternateBases

The bases that appear instead of the reference bases. Multiple alternate alleles are possible.

calls

The variant calls for this particular variant. Each one represents the determination of genotype with respect to this variant. Calls in this array are implicitly associated with this Variant.

created

The date this variant was created in milliseconds from the epoch.

end

The end position (exclusive), resulting in [start, end) closed-open interval. This is typically calculated by start + referenceBases.length.

id

The variant ID.

info

A map of additional variant information.

names

Names for the variant, for example a RefSNP ID.

referenceBases

The reference bases for this variant. They start at the given start position.

referenceName

The reference on which this variant occurs. (e.g. chr20 or X)

start

The start position at which this variant occurs (0-based). This corresponds to the first base of the string of reference bases. Genomic positions are non-negative integers less than reference length. Variants spanning the join of circular genomes are represented as two variants one on each side of the join (position 0).

updated

The time at which this variant was last updated in milliseconds from the epoch.

variantSetId

The ID of the VariantSet this variant belongs to. This transitively defines the ReferenceSet against which the Variant is to be interpreted.

Variant Annotation

class ga4gh.protocol.VariantAnnotationSet(**kwargs)

A VariantAnnotationSet record groups VariantAnnotation records. It is derived from a VariantSet and holds information describing the software and reference data used in the annotation.

analysis

Analysis details. It is essential to supply versions for all software and reference data used.

id

The ID of the variant annotation set record

name

The variant annotation set name.

variantSetId

The ID of the variant set to which this annotation set belongs

class ga4gh.protocol.VariantAnnotation(**kwargs)

A VariantAnnotation record represents the result of comparing a variant to a set of reference data.

createDateTime

The :ref:ISO 8601 <metadata_date_time> time at which this record was created.

id

The ID of this VariantAnnotation.

info

Additional annotation data in key-value pairs.

transcriptEffects

The transcript effect annotation for the alleles of this variant. Each one represents the effect of a single allele on a single transcript.

variantAnnotationSetId

The ID of the variant annotation set this record belongs to.

variantId

The variant ID.

class ga4gh.protocol.TranscriptEffect(**kwargs)

A transcript effect record is a set of information describing the effect of an allele on a transcript

alternateBases

Alternate allele - a variant may have more than one alternate allele, each of which will have distinct annotation.

analysisResults

Output from prediction packages such as SIFT

cDNALocation

Change relative to cDNA

effects

Effect of variant on this feature

featureId

The id of the transcript feature the annotation is relative to

hgvsAnnotation

Human Genome Variation Society variant descriptions

id

The ID of the transcript effect record

proteinLocation

Change relative to protein

Reads

class ga4gh.protocol.ReadGroupSet(**kwargs)

A ReadGroupSet is a logical collection of ReadGroups. Typically one ReadGroupSet represents all the reads from one experimental sample.

datasetId

The ID of the dataset this read group set belongs to.

id

The read group set ID.

name

The read group set name.

readGroups

The read groups in this set.

stats

Statistical data on reads in this read group set.

class ga4gh.protocol.ReadGroup(**kwargs)

A ReadGroup is a set of reads derived from one physical sequencing process.

created

The time at which this read group was created in milliseconds from the epoch.

datasetId

The ID of the dataset this read group belongs to.

description

The read group description.

experiment

The experiment used to generate this read group.

id

The read group ID.

info

A map of additional read group information.

name

The read group name.

predictedInsertSize

The predicted insert size of this read group.

programs

The programs used to generate this read group.

referenceSetId

The ID of the reference set to which the reads in this read group are aligned. Required if there are any read alignments.

sampleId

The sample this read group’s data was generated from. Note: the current API does not have a rigorous definition of sample. Therefore, this field actually contains an arbitrary string, typically corresponding to the SM tag in a BAM file.

stats

Statistical data on reads in this read group.

updated

The time at which this read group was last updated in milliseconds from the epoch.

class ga4gh.protocol.ReadAlignment(**kwargs)

Each read alignment describes an alignment with additional information about the fragment and the read. A read alignment object is equivalent to a line in a SAM file.

alignedQuality

The quality of the read sequence contained in this alignment record (equivalent to QUAL in SAM). alignedSequence and alignedQuality may be shorter than the full read sequence and quality. This will occur if the alignment is part of a chimeric alignment, or if the read was trimmed. When this occurs, the CIGAR for this read will begin/end with a hard clip operator that will indicate the length of the excised sequence.

alignedSequence

The bases of the read sequence contained in this alignment record (equivalent to SEQ in SAM). alignedSequence and alignedQuality may be shorter than the full read sequence and quality. This will occur if the alignment is part of a chimeric alignment, or if the read was trimmed. When this occurs, the CIGAR for this read will begin/end with a hard clip operator that will indicate the length of the excised sequence.

alignment

The alignment for this alignment record. This field will be null if the read is unmapped.

duplicateFragment

The fragment is a PCR or optical duplicate (SAM flag 0x400).

failedVendorQualityChecks

The read fails platform or vendor quality checks (SAM flag 0x200).

fragmentLength

The observed length of the fragment, equivalent to TLEN in SAM.

fragmentName

The fragment name. Equivalent to QNAME (query template name) in SAM.

id

The read alignment ID. This ID is unique within the read group this alignment belongs to. For performance reasons, this field may be omitted by a backend. If provided, its intended use is to make caching and UI display easier for genome browsers and other lightweight clients.

improperPlacement

The orientation and the distance between reads from the fragment are inconsistent with the sequencing protocol (inverse of SAM flag 0x2)

info

A map of additional read alignment information.

nextMatePosition

The mapping of the primary alignment of the (readNumber+1)%numberReads read in the fragment. It replaces mate position and mate strand in SAM.

numberReads

The number of reads in the fragment (extension to SAM flag 0x1)

readGroupId

The ID of the read group this read belongs to. (Every read must belong to exactly one read group.)

readNumber

The read ordinal in the fragment, 0-based and less than numberReads. This field replaces SAM flag 0x40 and 0x80 and is intended to more cleanly represent multiple reads per fragment.

secondaryAlignment

Whether this alignment is secondary. Equivalent to SAM flag 0x100. A secondary alignment represents an alternative to the primary alignment for this read. Aligners may return secondary alignments if a read can map ambiguously to multiple coordinates in the genome. By convention, each read has one and only one alignment where both secondaryAlignment and supplementaryAlignment are false.

supplementaryAlignment

Whether this alignment is supplementary. Equivalent to SAM flag 0x800. Supplementary alignments are used in the representation of a chimeric alignment. In a chimeric alignment, a read is split into multiple linear alignments that map to different reference contigs. The first linear alignment in the read will be designated as the representative alignment; the remaining linear alignments will be designated as supplementary alignments. These alignments may have different mapping quality scores. In each linear alignment in a chimeric alignment, the read will be hard clipped. The alignedSequence and alignedQuality fields in the alignment record will only represent the bases for its respective linear alignment.

class ga4gh.protocol.Position(**kwargs)

A Position is an unoriented base in some Reference. A Position is represented by a Reference name, and a base number on that Reference (0-based).

position

The 0-based offset from the start of the forward strand for that Reference. Genomic positions are non-negative integers less than Reference length.

referenceName

The name of the Reference on which the Position is located.

strand

Strand the position is associated with.

Client API

Todo

Add overview documentation for the client API.

class ga4gh.client.HttpClient(urlPrefix, logLevel=30, authenticationKey=None)

The GA4GH HTTP client. This class provides methods corresponding to the GA4GH search and object GET methods.

Todo

Add a better description of the role of this class and include links to the high-level API documention.

Parameters:
  • urlPrefix (str) – The base URL of the GA4GH server we wish to communicate with. This should include the ‘http’ or ‘https’ prefix.
  • logLevel (int) – The amount of debugging information to log using the logging module. This is logging.WARNING by default.
  • authenticationKey (str) – The authentication key provided by the server after logging in.
getDataset(datasetId)

Returns the Dataset with the specified ID from the server.

Parameters:datasetId (str) – The ID of the Dataset of interest.
Returns:The Dataset of interest.
Return type:ga4gh.protocol.Dataset
getReadGroup(readGroupId)

Returns the ReadGroup with the specified ID from the server.

Parameters:readGroupId (str) – The ID of the ReadGroup of interest.
Returns:The ReadGroup of interest.
Return type:ga4gh.protocol.ReadGroup
getReadGroupSet(readGroupSetId)

Returns the ReadGroupSet with the specified ID from the server.

Parameters:readGroupSetId (str) – The ID of the ReadGroupSet of interest.
Returns:The ReadGroupSet of interest.
Return type:ga4gh.protocol.ReadGroupSet
getReference(referenceId)

Returns the Reference with the specified ID from the server.

Parameters:referenceId (str) – The ID of the Reference of interest.
Returns:The Reference of interest.
Return type:ga4gh.protocol.Reference
getReferenceSet(referenceSetId)

Returns the ReferenceSet with the specified ID from the server.

Parameters:referenceSetId (str) – The ID of the ReferenceSet of interest.
Returns:The ReferenceSet of interest.
Return type:ga4gh.protocol.ReferenceSet
getVariant(variantId)

Returns the Variant with the specified ID from the server.

Parameters:variantId (str) – The ID of the Variant of interest.
Returns:The Variant of interest.
Return type:ga4gh.protocol.Variant
getVariantSet(variantSetId)

Returns the VariantSet with the specified ID from the server.

Parameters:variantSetId (str) – The ID of the VariantSet of interest.
Returns:The VariantSet of interest.
Return type:ga4gh.protocol.VariantSet
searchDatasets()

Returns an iterator over the Datasets on the server.

Returns:An iterator over the ga4gh.protocol.Dataset objects on the server.
searchReadGroupSets(datasetId, name=None)

Returns an iterator over the ReadGroupSets fulfilling the specified conditions from the specified Dataset.

Parameters:name (str) – Only ReadGroupSets matching the specified name will be returned.
Returns:An iterator over the ga4gh.protocol.ReadGroupSet objects defined by the query parameters.
Return type:iter
searchReads(readGroupIds, referenceId=None, start=None, end=None)

Returns an iterator over the Reads fulfilling the specified conditions from the specified ReadGroupIds.

Parameters:
  • readGroupIds (str) – The IDs of the ga4gh.protocol.ReadGroup of interest.
  • referenceId (str) – The name of the ga4gh.protocol.Reference we wish to return reads mapped to.
  • start (int) – The start position (0-based) of this query. If a reference is specified, this defaults to 0. Genomic positions are non-negative integers less than reference length. Requests spanning the join of circular genomes are represented as two requests one on each side of the join (position 0).
  • end (int) – The end position (0-based, exclusive) of this query. If a reference is specified, this defaults to the reference’s length.
Returns:

An iterator over the ga4gh.protocol.ReadAlignment objects defined by the query parameters.

Return type:

iter

searchReferenceSets(accession=None, md5checksum=None, assemblyId=None)

Returns an iterator over the ReferenceSets fulfilling the specified conditions.

Parameters:
  • accession (str) – If not null, return the reference sets for which the accession matches this string (case-sensitive, exact match).
  • md5checksum (str) – If not null, return the reference sets for which the md5checksum matches this string (case-sensitive, exact match). See ga4gh.protocol.ReferenceSet::md5checksum for details.
  • assemblyId (str) – If not null, return the reference sets for which the assemblyId matches this string (case-sensitive, exact match).
Returns:

An iterator over the ga4gh.protocol.ReferenceSet objects defined by the query parameters.

searchReferences(referenceSetId, accession=None, md5checksum=None)

Returns an iterator over the References fulfilling the specified conditions from the specified Dataset.

Parameters:
  • referenceSetId (str) – The ReferenceSet to search.
  • accession (str) – If not None, return the references for which the accession matches this string (case-sensitive, exact match).
  • md5checksum (str) – If not None, return the references for which the md5checksum matches this string (case-sensitive, exact match).
Returns:

An iterator over the ga4gh.protocol.Reference objects defined by the query parameters.

searchVariantSets(datasetId)

Returns an iterator over the VariantSets fulfilling the specified conditions from the specified Dataset.

Parameters:datasetId (str) – The ID of the ga4gh.protocol.Dataset of interest.
Returns:An iterator over the ga4gh.protocol.VariantSet objects defined by the query parameters.
searchVariants(variantSetId, start=None, end=None, referenceName=None, callSetIds=None)

Returns an iterator over the Variants fulfilling the specified conditions from the specified VariantSet.

Parameters:
  • variantSetId (str) – The ID of the ga4gh.protocol.VariantSet of interest.
  • start (int) – Required. The beginning of the window (0-based, inclusive) for which overlapping variants should be returned. Genomic positions are non-negative integers less than reference length. Requests spanning the join of circular genomes are represented as two requests one on each side of the join (position 0).
  • end (int) – Required. The end of the window (0-based, exclusive) for which overlapping variants should be returned.
  • referenceName (str) – The name of the ga4gh.protocol.Reference we wish to return variants from.
  • callSetIds (list) – Only return variant calls which belong to call sets with these IDs. If an empty array, returns variants without any call objects. If null, returns all variant calls.
Returns:

An iterator over the ga4gh.protocol.Variant objects defined by the query parameters.

Return type:

iter