Alternative way to get genomic positions #144

grst · 2024-10-04T06:37:04Z

Description of feature

Currently the only way to get genomic positions is through reading a GTF file. This is (a) slow and (b) gtfparse repeatedly makes problems.

It could be more conveniente to retrieve this information from online sources such as biomart or Bioconductor AnnotationHub.

Then gtfparse could become an optional dependency.

zktuong · 2024-10-04T09:12:05Z

This works very well for me as well
https://scanpy.readthedocs.io/en/stable/generated/scanpy.queries.biomart_annotations.html

def query_biomart() -> pd.DataFrame:
    """
    Extract gene annotations from Biomart.

    Parameters
    ----------
    index_key : str, optional
        Index key for the DataFrame.

    Returns
    -------
    pd.DataFrame
        DataFrame with gene annotations from Biomart.
    """
    annot = sc.queries.biomart_annotations(
        "hsapiens",
        [
            "ensembl_gene_id",
            "hgnc_symbol",
            "start_position",
            "end_position",
            "chromosome_name",
        ],
        use_cache=True,
    ).rename(
        columns={
            "ensembl_gene_id": "gene_ids",
            "hgnc_symbol": "gene_symbol",
            "start_position": "start",
            "end_position": "end",
            "chromosome_name": "chromosome",
        }
    )
    return annot
    
def annotate_var(
    adata: AnnData, annotation: pd.DataFrame, index_key: str = "gene_ids"
) -> None:
    """
    Annotate the features with in an AnnData object.

    Parameters
    ----------
    adata : AnnData
        Input AnnData object.
    annotation : pd.DataFrame
        Gene annotation DataFrame.
    index_key : str, optional
        Index key for the DataFrame.
    """
    for col in ["start", "end", "chromosome", index_key]:
        assert (
            col in annotation.columns
        ), f"Annotation DataFrame must contain the column named `{col}`."

    for col in annotation:
        var_dict = annotation[col].to_dict()
        adata.var[col] = [
            var_dict[x] if x in var_dict else None for x in adata.var[index_key]
        ]

grst · 2024-10-04T09:17:12Z

very nice 🤩

grst added the enhancement New feature or request label Oct 4, 2024

grst mentioned this issue Oct 4, 2024

gtfparse dependency causes issues with latest numpy, pandas and pyarrow #143

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative way to get genomic positions #144

Alternative way to get genomic positions #144

grst commented Oct 4, 2024

zktuong commented Oct 4, 2024 •

edited

Loading

grst commented Oct 4, 2024

Alternative way to get genomic positions #144

Alternative way to get genomic positions #144

Comments

grst commented Oct 4, 2024

Description of feature

zktuong commented Oct 4, 2024 • edited Loading

grst commented Oct 4, 2024

zktuong commented Oct 4, 2024 •

edited

Loading