Python Client Library¶
This is the GA4GH client API library. This is a convenient wrapper for the low-level HTTP GA4GH API, and abstracts away network centric details such as paging. The methods and types used by the client library are defined by the GA4GH schema.
Warning
This client API should be considered early alpha quality, and may change in arbitrary ways. In particular, the current camelCase convention for identifiers may change to snake_case in future.
Todo
A full description of this API and links to a tutorial on how to use it, as well as a quickstart showing the basic usage.
Types¶
Todo
A short overview of the types and links to the high-level docs.
References¶
-
class
ga4gh.protocol.ReferenceSet(**kwargs)¶ A ReferenceSet is a set of References which typically comprise a reference assembly, such as GRCh38. A ReferenceSet defines a common coordinate space for comparing reference-aligned experimental data.
-
assemblyId¶ Public id of this reference set, such as GRCh37.
-
description¶ Optional free text description of this reference set.
-
id¶ The reference set ID. Unique in the repository.
-
isDerived¶ A reference set may be derived from a source if it contains additional sequences, or some of the sequences within it are derived (see the definition of isDerived in Reference).
-
md5checksum¶ Order-independent MD5 checksum which identifies this ReferenceSet. To compute this checksum, make a list of Reference.md5checksum for all References in this set. Then sort that list, and take the MD5 hash of all the strings concatenated together. Express the hash as a lower-case hexadecimal string.
-
name¶ The reference set name.
-
ncbiTaxonId¶ ID from http://www.ncbi.nlm.nih.gov/taxonomy (e.g. 9606->human) indicating the species which this assembly is intended to model. Note that contained References may specify a different ncbiTaxonId, as assemblies may contain reference sequences which do not belong to the modeled species, e.g. EBV in a human reference genome.
-
sourceAccessions¶ All known corresponding accession IDs in INSDC (GenBank/ENA/DDBJ) ideally with a version number, e.g. NC_000001.11.
-
sourceURI¶ Specifies a FASTA format file/string.
-
-
class
ga4gh.protocol.Reference(**kwargs)¶ A Reference is a canonical assembled contig, intended to act as a reference coordinate space for other genomic annotations. A single Reference might represent the human chromosome 1, for instance. References are designed to be immutable.
-
id¶ The reference ID. Unique within the repository.
-
isDerived¶ A sequence X is said to be derived from source sequence Y, if X and Y are of the same length and the per-base sequence divergence at A/C/G/T bases is sufficiently small. Two sequences derived from the same official sequence share the same coordinates and annotations, and can be replaced with the official sequence for certain use cases.
-
length¶ The length of this reference’s sequence.
-
md5checksum¶ The MD5 checksum uniquely representing this Reference as a lower-case hexadecimal string, calculated as the MD5 of the upper-case sequence excluding all whitespace characters (this is equivalent to SQ:M5 in SAM).
-
name¶ The name of this reference. (e.g. ‘22’).
-
ncbiTaxonId¶ ID from http://www.ncbi.nlm.nih.gov/taxonomy (e.g. 9606->human).
-
sourceAccessions¶ All known corresponding accession IDs in INSDC (GenBank/ENA/DDBJ) which must include a version number, e.g. GCF_000001405.26.
-
sourceDivergence¶ The sourceDivergence is the fraction of non-indel bases that do not match the reference this record was derived from.
-
sourceURI¶ The URI from which the sequence was obtained. Specifies a FASTA format file/string with one name, sequence pair. In most cases, clients should call the getReferenceBases() method to obtain sequence bases for a Reference instead of attempting to retrieve this URI.
-
Datasets¶
-
class
ga4gh.protocol.Dataset(**kwargs)¶ A Dataset is a collection of related data of multiple types. Data providers decide how to group data into datasets. See [Metadata API](../api/metadata.html) for a more detailed discussion.
-
description¶ Additional, human-readable information on the dataset.
-
id¶ The dataset’s id, locally unique to the server instance.
-
name¶ The name of the dataset.
-
Variants¶
-
class
ga4gh.protocol.VariantSet(**kwargs)¶ A VariantSet is a collection of variants and variant calls intended to be analyzed together.
-
datasetId¶ The ID of the dataset this variant set belongs to.
-
id¶ The variant set ID.
-
metadata¶ Optional metadata associated with this variant set. This array can be used to store information about the variant set, such as information found in VCF header fields, that isn’t already available in first class fields such as “name”.
-
name¶ The variant set name.
-
referenceSetId¶ The ID of the reference set that describes the sequences used by the variants in this set.
-
-
class
ga4gh.protocol.CallSet(**kwargs)¶ A CallSet is a collection of calls that were generated by the same analysis of the same sample.
-
created¶ The date this call set was created in milliseconds from the epoch.
-
id¶ The call set ID.
-
info¶ A map of additional call set information.
-
name¶ The call set name.
-
sampleId¶ The sample this call set’s data was generated from. Note: the current API does not have a rigorous definition of sample. Therefore, this field actually contains an arbitrary string, typically corresponding to the sampleId field in the read groups used to generate this call set.
-
updated¶ The time at which this call set was last updated in milliseconds from the epoch.
-
variantSetIds¶ The IDs of the variant sets this call set has calls in.
-
-
class
ga4gh.protocol.Variant(**kwargs)¶ A Variant represents a change in DNA sequence relative to some reference. For example, a variant could represent a SNP or an insertion. Variants belong to a VariantSet. This is equivalent to a row in VCF.
-
alternateBases¶ The bases that appear instead of the reference bases. Multiple alternate alleles are possible.
-
calls¶ The variant calls for this particular variant. Each one represents the determination of genotype with respect to this variant. Calls in this array are implicitly associated with this Variant.
-
created¶ The date this variant was created in milliseconds from the epoch.
-
end¶ The end position (exclusive), resulting in [start, end) closed-open interval. This is typically calculated by start + referenceBases.length.
-
id¶ The variant ID.
-
info¶ A map of additional variant information.
-
names¶ Names for the variant, for example a RefSNP ID.
-
referenceBases¶ The reference bases for this variant. They start at the given start position.
-
referenceName¶ The reference on which this variant occurs. (e.g. chr20 or X)
-
start¶ The start position at which this variant occurs (0-based). This corresponds to the first base of the string of reference bases. Genomic positions are non-negative integers less than reference length. Variants spanning the join of circular genomes are represented as two variants one on each side of the join (position 0).
-
updated¶ The time at which this variant was last updated in milliseconds from the epoch.
-
variantSetId¶ The ID of the VariantSet this variant belongs to. This transitively defines the ReferenceSet against which the Variant is to be interpreted.
-
Variant Annotation¶
-
class
ga4gh.protocol.VariantAnnotationSet(**kwargs)¶ A VariantAnnotationSet record groups VariantAnnotation records. It is derived from a VariantSet and holds information describing the software and reference data used in the annotation.
-
analysis¶ Analysis details. It is essential to supply versions for all software and reference data used.
-
id¶ The ID of the variant annotation set record
-
name¶ The variant annotation set name.
-
variantSetId¶ The ID of the variant set to which this annotation set belongs
-
-
class
ga4gh.protocol.VariantAnnotation(**kwargs)¶ A VariantAnnotation record represents the result of comparing a variant to a set of reference data.
-
createDateTime¶ The :ref:ISO 8601 <metadata_date_time> time at which this record was created.
-
id¶ The ID of this VariantAnnotation.
-
info¶ Additional annotation data in key-value pairs.
-
transcriptEffects¶ The transcript effect annotation for the alleles of this variant. Each one represents the effect of a single allele on a single transcript.
-
variantAnnotationSetId¶ The ID of the variant annotation set this record belongs to.
-
variantId¶ The variant ID.
-
-
class
ga4gh.protocol.TranscriptEffect(**kwargs)¶ A transcript effect record is a set of information describing the effect of an allele on a transcript
-
alternateBases¶ Alternate allele - a variant may have more than one alternate allele, each of which will have distinct annotation.
-
analysisResults¶ Output from prediction packages such as SIFT
-
cDNALocation¶ Change relative to cDNA
-
effects¶ Effect of variant on this feature
-
featureId¶ The id of the transcript feature the annotation is relative to
-
hgvsAnnotation¶ Human Genome Variation Society variant descriptions
-
id¶ The ID of the transcript effect record
-
proteinLocation¶ Change relative to protein
-
Reads¶
-
class
ga4gh.protocol.ReadGroupSet(**kwargs)¶ A ReadGroupSet is a logical collection of ReadGroups. Typically one ReadGroupSet represents all the reads from one experimental sample.
-
datasetId¶ The ID of the dataset this read group set belongs to.
-
id¶ The read group set ID.
-
name¶ The read group set name.
-
readGroups¶ The read groups in this set.
-
stats¶ Statistical data on reads in this read group set.
-
-
class
ga4gh.protocol.ReadGroup(**kwargs)¶ A ReadGroup is a set of reads derived from one physical sequencing process.
-
created¶ The time at which this read group was created in milliseconds from the epoch.
-
datasetId¶ The ID of the dataset this read group belongs to.
-
description¶ The read group description.
-
experiment¶ The experiment used to generate this read group.
-
id¶ The read group ID.
-
info¶ A map of additional read group information.
-
name¶ The read group name.
-
predictedInsertSize¶ The predicted insert size of this read group.
-
programs¶ The programs used to generate this read group.
-
referenceSetId¶ The ID of the reference set to which the reads in this read group are aligned. Required if there are any read alignments.
-
sampleId¶ The sample this read group’s data was generated from. Note: the current API does not have a rigorous definition of sample. Therefore, this field actually contains an arbitrary string, typically corresponding to the SM tag in a BAM file.
-
stats¶ Statistical data on reads in this read group.
-
updated¶ The time at which this read group was last updated in milliseconds from the epoch.
-
-
class
ga4gh.protocol.ReadAlignment(**kwargs)¶ Each read alignment describes an alignment with additional information about the fragment and the read. A read alignment object is equivalent to a line in a SAM file.
-
alignedQuality¶ The quality of the read sequence contained in this alignment record (equivalent to QUAL in SAM). alignedSequence and alignedQuality may be shorter than the full read sequence and quality. This will occur if the alignment is part of a chimeric alignment, or if the read was trimmed. When this occurs, the CIGAR for this read will begin/end with a hard clip operator that will indicate the length of the excised sequence.
-
alignedSequence¶ The bases of the read sequence contained in this alignment record (equivalent to SEQ in SAM). alignedSequence and alignedQuality may be shorter than the full read sequence and quality. This will occur if the alignment is part of a chimeric alignment, or if the read was trimmed. When this occurs, the CIGAR for this read will begin/end with a hard clip operator that will indicate the length of the excised sequence.
-
alignment¶ The alignment for this alignment record. This field will be null if the read is unmapped.
-
duplicateFragment¶ The fragment is a PCR or optical duplicate (SAM flag 0x400).
-
failedVendorQualityChecks¶ The read fails platform or vendor quality checks (SAM flag 0x200).
-
fragmentLength¶ The observed length of the fragment, equivalent to TLEN in SAM.
-
fragmentName¶ The fragment name. Equivalent to QNAME (query template name) in SAM.
-
id¶ The read alignment ID. This ID is unique within the read group this alignment belongs to. For performance reasons, this field may be omitted by a backend. If provided, its intended use is to make caching and UI display easier for genome browsers and other lightweight clients.
-
improperPlacement¶ The orientation and the distance between reads from the fragment are inconsistent with the sequencing protocol (inverse of SAM flag 0x2)
-
info¶ A map of additional read alignment information.
-
nextMatePosition¶ The mapping of the primary alignment of the (readNumber+1)%numberReads read in the fragment. It replaces mate position and mate strand in SAM.
-
numberReads¶ The number of reads in the fragment (extension to SAM flag 0x1)
-
readGroupId¶ The ID of the read group this read belongs to. (Every read must belong to exactly one read group.)
-
readNumber¶ The read ordinal in the fragment, 0-based and less than numberReads. This field replaces SAM flag 0x40 and 0x80 and is intended to more cleanly represent multiple reads per fragment.
-
secondaryAlignment¶ Whether this alignment is secondary. Equivalent to SAM flag 0x100. A secondary alignment represents an alternative to the primary alignment for this read. Aligners may return secondary alignments if a read can map ambiguously to multiple coordinates in the genome. By convention, each read has one and only one alignment where both secondaryAlignment and supplementaryAlignment are false.
-
supplementaryAlignment¶ Whether this alignment is supplementary. Equivalent to SAM flag 0x800. Supplementary alignments are used in the representation of a chimeric alignment. In a chimeric alignment, a read is split into multiple linear alignments that map to different reference contigs. The first linear alignment in the read will be designated as the representative alignment; the remaining linear alignments will be designated as supplementary alignments. These alignments may have different mapping quality scores. In each linear alignment in a chimeric alignment, the read will be hard clipped. The alignedSequence and alignedQuality fields in the alignment record will only represent the bases for its respective linear alignment.
-
-
class
ga4gh.protocol.Position(**kwargs)¶ A Position is an unoriented base in some Reference. A Position is represented by a Reference name, and a base number on that Reference (0-based).
-
position¶ The 0-based offset from the start of the forward strand for that Reference. Genomic positions are non-negative integers less than Reference length.
-
referenceName¶ The name of the Reference on which the Position is located.
-
strand¶ Strand the position is associated with.
-
Client API¶
Todo
Add overview documentation for the client API.
-
class
ga4gh.client.HttpClient(urlPrefix, logLevel=30, authenticationKey=None)¶ The GA4GH HTTP client. This class provides methods corresponding to the GA4GH search and object GET methods.
Todo
Add a better description of the role of this class and include links to the high-level API documention.
Parameters: - urlPrefix (str) – The base URL of the GA4GH server we wish to communicate with. This should include the ‘http’ or ‘https’ prefix.
- logLevel (int) – The amount of debugging information to log using
the
loggingmodule. This islogging.WARNINGby default. - authenticationKey (str) – The authentication key provided by the server after logging in.
-
getDataset(datasetId)¶ Returns the Dataset with the specified ID from the server.
Parameters: datasetId (str) – The ID of the Dataset of interest. Returns: The Dataset of interest. Return type: ga4gh.protocol.Dataset
-
getReadGroup(readGroupId)¶ Returns the ReadGroup with the specified ID from the server.
Parameters: readGroupId (str) – The ID of the ReadGroup of interest. Returns: The ReadGroup of interest. Return type: ga4gh.protocol.ReadGroup
-
getReadGroupSet(readGroupSetId)¶ Returns the ReadGroupSet with the specified ID from the server.
Parameters: readGroupSetId (str) – The ID of the ReadGroupSet of interest. Returns: The ReadGroupSet of interest. Return type: ga4gh.protocol.ReadGroupSet
-
getReference(referenceId)¶ Returns the Reference with the specified ID from the server.
Parameters: referenceId (str) – The ID of the Reference of interest. Returns: The Reference of interest. Return type: ga4gh.protocol.Reference
-
getReferenceSet(referenceSetId)¶ Returns the ReferenceSet with the specified ID from the server.
Parameters: referenceSetId (str) – The ID of the ReferenceSet of interest. Returns: The ReferenceSet of interest. Return type: ga4gh.protocol.ReferenceSet
-
getVariant(variantId)¶ Returns the Variant with the specified ID from the server.
Parameters: variantId (str) – The ID of the Variant of interest. Returns: The Variant of interest. Return type: ga4gh.protocol.Variant
-
getVariantSet(variantSetId)¶ Returns the VariantSet with the specified ID from the server.
Parameters: variantSetId (str) – The ID of the VariantSet of interest. Returns: The VariantSet of interest. Return type: ga4gh.protocol.VariantSet
-
searchDatasets()¶ Returns an iterator over the Datasets on the server.
Returns: An iterator over the ga4gh.protocol.Datasetobjects on the server.
-
searchReadGroupSets(datasetId, name=None)¶ Returns an iterator over the ReadGroupSets fulfilling the specified conditions from the specified Dataset.
Parameters: name (str) – Only ReadGroupSets matching the specified name will be returned. Returns: An iterator over the ga4gh.protocol.ReadGroupSetobjects defined by the query parameters.Return type: iter
-
searchReads(readGroupIds, referenceId=None, start=None, end=None)¶ Returns an iterator over the Reads fulfilling the specified conditions from the specified ReadGroupIds.
Parameters: - readGroupIds (str) – The IDs of the
ga4gh.protocol.ReadGroupof interest. - referenceId (str) – The name of the
ga4gh.protocol.Referencewe wish to return reads mapped to. - start (int) – The start position (0-based) of this query. If a reference is specified, this defaults to 0. Genomic positions are non-negative integers less than reference length. Requests spanning the join of circular genomes are represented as two requests one on each side of the join (position 0).
- end (int) – The end position (0-based, exclusive) of this query. If a reference is specified, this defaults to the reference’s length.
Returns: An iterator over the
ga4gh.protocol.ReadAlignmentobjects defined by the query parameters.Return type: - readGroupIds (str) – The IDs of the
-
searchReferenceSets(accession=None, md5checksum=None, assemblyId=None)¶ Returns an iterator over the ReferenceSets fulfilling the specified conditions.
Parameters: - accession (str) – If not null, return the reference sets for which the accession matches this string (case-sensitive, exact match).
- md5checksum (str) – If not null, return the reference sets for
which the md5checksum matches this string (case-sensitive, exact
match). See
ga4gh.protocol.ReferenceSet::md5checksumfor details. - assemblyId (str) – If not null, return the reference sets for which the assemblyId matches this string (case-sensitive, exact match).
Returns: An iterator over the
ga4gh.protocol.ReferenceSetobjects defined by the query parameters.
-
searchReferences(referenceSetId, accession=None, md5checksum=None)¶ Returns an iterator over the References fulfilling the specified conditions from the specified Dataset.
Parameters: - referenceSetId (str) – The ReferenceSet to search.
- accession (str) – If not None, return the references for which the accession matches this string (case-sensitive, exact match).
- md5checksum (str) – If not None, return the references for which the md5checksum matches this string (case-sensitive, exact match).
Returns: An iterator over the
ga4gh.protocol.Referenceobjects defined by the query parameters.
-
searchVariantSets(datasetId)¶ Returns an iterator over the VariantSets fulfilling the specified conditions from the specified Dataset.
Parameters: datasetId (str) – The ID of the ga4gh.protocol.Datasetof interest.Returns: An iterator over the ga4gh.protocol.VariantSetobjects defined by the query parameters.
-
searchVariants(variantSetId, start=None, end=None, referenceName=None, callSetIds=None)¶ Returns an iterator over the Variants fulfilling the specified conditions from the specified VariantSet.
Parameters: - variantSetId (str) – The ID of the
ga4gh.protocol.VariantSetof interest. - start (int) – Required. The beginning of the window (0-based, inclusive) for which overlapping variants should be returned. Genomic positions are non-negative integers less than reference length. Requests spanning the join of circular genomes are represented as two requests one on each side of the join (position 0).
- end (int) – Required. The end of the window (0-based, exclusive) for which overlapping variants should be returned.
- referenceName (str) – The name of the
ga4gh.protocol.Referencewe wish to return variants from. - callSetIds (list) – Only return variant calls which belong to call sets with these IDs. If an empty array, returns variants without any call objects. If null, returns all variant calls.
Returns: An iterator over the
ga4gh.protocol.Variantobjects defined by the query parameters.Return type: - variantSetId (str) – The ID of the