Skip to content

Instantly share code, notes, and snippets.

@rhngla
Last active January 22, 2024 22:40
Show Gist options
  • Save rhngla/fc5658109d2ab2e4d55822dcfe3531b3 to your computer and use it in GitHub Desktop.
Save rhngla/fc5658109d2ab2e4d55822dcfe3531b3 to your computer and use it in GitHub Desktop.
gget to search for Ensembl ids of gene names
# Some gene names have aliases; those can be found using gget.
import gget
import pandas as pd
gene_list = ['1190002N15Rik', '1700019L22Rik', '1700086L19Rik', '2610100L16Rik', '2900055J20Rik', '2900092D14Rik', '3110035E14Rik', '5031426D15Rik', '5930412G12Rik', '6330403A02Rik', '6430573F11Rik', 'A230070E04Rik', 'A330050F15Rik', 'AF529169', 'Ak3l2-ps', 'Apitd1', 'BC030499', 'BC048546', 'Bai2', 'Cecr6', 'Ctgf', 'Cxx1a', 'E130012A19Rik', 'E530001K10Rik', 'Epb4.1', 'Epb4.1l4a', 'Fam150b', 'Fam196a', 'Fam19a1', 'Fam19a2', 'Fam19a5', 'Fam212b', 'Fam46a', 'Fam84a', 'Fam84b', 'Gad1-ps', 'Gpr123', 'Gpr125', 'Gpr126', 'Gpr133', 'Gucy1a3', 'Hrasls', 'Hyi', 'I830012O16Rik', 'LOC100503338', 'LOC101056001', 'LOC102632463', 'LOC102633357', 'LOC102633724', 'LOC102634132', 'LOC102634502', 'LOC102635502', 'LOC102636041', 'LOC102636700', 'LOC102638670', 'LOC102638890', 'LOC102640573', 'LOC102643175', 'LOC105242578', 'LOC105242710', 'LOC105242740', 'LOC105243130', 'LOC105243282', 'LOC105243425', 'LOC105243542', 'LOC105244000', 'LOC105244162', 'LOC105244192', 'LOC105244376', 'LOC105245190', 'LOC105245295', 'LOC105245487', 'LOC105245838', 'LOC105245960', 'LOC105246064', 'LOC105246327', 'LOC105246694', 'LOC105246811', 'LOC105246832', 'LOC105247131', 'Lphn2', 'Lppr1', 'Lppr3', 'Mfsd4', 'N28178', 'Nov', 'Obfc1', 'Pcnxl2', 'Ppapdc1a', 'Ptchd2', 'Pvrl1', 'Pvrl3', 'Stmn1-rs2', 'Tcrb', 'Wbscr17']
# Note: this is slow for large gene lists. This naïve implementation takes ~4 minutes to run.
df = gget.search(gene_list, "mus_musculus")
# function to check the synonym field
def check_synonym(synonym, l):
for item in synonym:
if item in l:
return item
return None
df['found'] = df['synonym'].apply(lambda x: check_synonym(x, gene_list))
df['keep'] = df['found'].apply(lambda x: x is not None)
# keep gget response if gene is found either in gene name or in synonyms
df_ = df[(df['gene_name'].isin(gene_list)) | (df['keep'])]
df_ = df_.drop(columns=['keep'])
df_['found'] = df_['found'].fillna(df_['gene_name'])
df_.reset_index(drop=True, inplace=True)
# use the original list to create a dataframe. Empty rows indicate gene names for which gget did not find records.
df_keep = pd.DataFrame(gene_list, columns=['gene_list'])
df_keep = df_keep.merge(df_, how='outer', left_on='gene_list', right_on='found')
# save results to file. (included in this gist)
df_keep.to_csv('01-gene-ensembl.csv')
gene_list ensembl_id gene_name ensembl_description ext_ref_description biotype synonym url synonym_match found
0 1190002N15Rik ENSMUSG00000045414 Dipk2a divergent protein kinase domain 2A [Source:MGI Symbol;Acc:MGI:1916111] divergent protein kinase domain 2A protein_coding ['1190002N15Rik'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000045414 True 1190002N15Rik
1 1700019L22Rik
2 1700086L19Rik ENSMUSG00000071265 1700086L19Rik RIKEN cDNA 1700086L19 gene [Source:MGI Symbol;Acc:MGI:1921534] RIKEN cDNA 1700086L19 gene transcribed_unitary_pseudogene [None] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000071265 False 1700086L19Rik
3 1700086L19Rik ENSMUSG00000121877 1700086L19Rik RIKEN cDNA 1700086L19 gene [Source:NCBI gene (formerly Entrezgene);Acc:74284] RIKEN cDNA 1700086L19 gene lncRNA ['1700006L05Rik'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000121877 False 1700086L19Rik
4 2610100L16Rik ENSMUSG00000100252 Mir124-2hg Mir124-2 host gene (non-protein coding) [Source:MGI Symbol;Acc:MGI:1917691] Mir124-2 host gene (non-protein coding) lncRNA ['2610100L16Rik'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000100252 True 2610100L16Rik
5 2900055J20Rik ENSMUSG00000051401 Kctd16 potassium channel tetramerisation domain containing 16 [Source:MGI Symbol;Acc:MGI:1914659] potassium channel tetramerisation domain containing 16 protein_coding ['2900055J20Rik'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000051401 True 2900055J20Rik
6 2900092D14Rik
7 3110035E14Rik ENSMUSG00000067879 Vxn vexin [Source:MGI Symbol;Acc:MGI:1924232] vexin protein_coding ['3110035E14Rik'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000067879 True 3110035E14Rik
8 5031426D15Rik
9 5930412G12Rik ENSMUSG00000072591 Fzd10os frizzled class receptor 10, opposite strand [Source:MGI Symbol;Acc:MGI:2442398] frizzled class receptor 10, opposite strand lncRNA ['5930412G12Rik'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000072591 True 5930412G12Rik
10 6330403A02Rik ENSMUSG00000053963 Stum mechanosensory transduction mediator [Source:MGI Symbol;Acc:MGI:2138735] mechanosensory transduction mediator protein_coding ['6330403A02Rik'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000053963 True 6330403A02Rik
11 6430573F11Rik ENSMUSG00000039620 Trmt9b tRNA methyltransferase 9B [Source:MGI Symbol;Acc:MGI:2442328] tRNA methyltransferase 9B protein_coding ['6430573F11Rik'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000039620 True 6430573F11Rik
12 A230070E04Rik
13 A330050F15Rik ENSMUSG00000091636 Akain1 A kinase (PRKA) anchor inhibitor 1 [Source:MGI Symbol;Acc:MGI:2444600] A kinase (PRKA) anchor inhibitor 1 protein_coding ['A330050F15Rik'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000091636 True A330050F15Rik
14 AF529169 ENSMUSG00000039313 Minar1 membrane integral NOTCH2 associated receptor 1 [Source:MGI Symbol;Acc:MGI:2667167] membrane integral NOTCH2 associated receptor 1 protein_coding ['AF529169'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000039313 True AF529169
15 Ak3l2-ps ENSMUSG00000081957 Ak3l2-ps adenylate kinase 3-like 2, pseudogene [Source:MGI Symbol;Acc:MGI:3574349] adenylate kinase 3-like 2, pseudogene processed_pseudogene ['Ak3l2'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000081957 False Ak3l2-ps
16 Apitd1 ENSMUSG00000073705 Cenps centromere protein S [Source:MGI Symbol;Acc:MGI:1917178] centromere protein S protein_coding ['Apitd1'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000073705 True Apitd1
17 BC030499 ENSMUSG00000037593 Rskr ribosomal protein S6 kinase related [Source:MGI Symbol;Acc:MGI:2652869] ribosomal protein S6 kinase related protein_coding ['BC030499'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000037593 True BC030499
18 BC048546 ENSMUSG00000047228 A2ml1 alpha-2-macroglobulin like 1 [Source:MGI Symbol;Acc:MGI:3039594] alpha-2-macroglobulin like 1 protein_coding ['BC048546'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000047228 True BC048546
19 Bai2 ENSMUSG00000028782 Adgrb2 adhesion G protein-coupled receptor B2 [Source:MGI Symbol;Acc:MGI:2451244] adhesion G protein-coupled receptor B2 protein_coding ['Bai2'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000028782 True Bai2
20 Cecr6 ENSMUSG00000094626 Tmem121b transmembrane protein 121B [Source:MGI Symbol;Acc:MGI:2136977] transmembrane protein 121B protein_coding ['Cecr6'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000094626 True Cecr6
21 Ctgf ENSMUSG00000019997 Ccn2 cellular communication network factor 2 [Source:MGI Symbol;Acc:MGI:95537] cellular communication network factor 2 protein_coding ['Ccn2', 'Ctgf', 'Fisp12', 'Hcs24', 'hypertrophic chondrocyte-specific gene product 24'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000019997 True Ctgf
22 Cxx1a ENSMUSG00000067925 Rtl8a retrotransposon Gag like 8A [Source:MGI Symbol;Acc:MGI:1913408] retrotransposon Gag like 8A protein_coding ['Cxx1a'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000067925 True Cxx1a
23 E130012A19Rik ENSMUSG00000043439 Epop elongin BC and polycomb repressive complex 2 associated protein [Source:MGI Symbol;Acc:MGI:2143991] elongin BC and polycomb repressive complex 2 associated protein protein_coding ['E130012A19Rik'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000043439 True E130012A19Rik
24 E530001K10Rik ENSMUSG00000075020 Mir670hg MIR670 host gene (non-protein coding) [Source:MGI Symbol;Acc:MGI:3041234] MIR670 host gene (non-protein coding) lncRNA ['E530001K10Rik'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000075020 True E530001K10Rik
25 Epb4.1 ENSMUSG00000028906 Epb41 erythrocyte membrane protein band 4.1 [Source:MGI Symbol;Acc:MGI:95401] erythrocyte membrane protein band 4.1 protein_coding ['Epb4.1'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000028906 True Epb4.1
26 Epb4.1l4a ENSMUSG00000024376 Epb41l4a erythrocyte membrane protein band 4.1 like 4a [Source:MGI Symbol;Acc:MGI:103007] erythrocyte membrane protein band 4.1 like 4a protein_coding ['Epb4.1l4a'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000024376 True Epb4.1l4a
27 Fam150b ENSMUSG00000054204 Alkal2 ALK and LTK ligand 2 [Source:MGI Symbol;Acc:MGI:3697448] ALK and LTK ligand 2 protein_coding ['6230419C23Rik', 'Augalpha', 'Augmentor alpha', 'Fam150b'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000054204 True Fam150b
28 Fam196a ENSMUSG00000073805 Insyn2a inhibitory synaptic factor 2A [Source:MGI Symbol;Acc:MGI:3605068] inhibitory synaptic factor 2A protein_coding ['Fam196a'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000073805 True Fam196a
29 Fam19a1 ENSMUSG00000059187 Tafa1 TAFA chemokine like family member 1 [Source:MGI Symbol;Acc:MGI:2443695] TAFA chemokine like family member 1 protein_coding ['C630007B19Rik', 'Fam19a1', 'Tafa-1'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000059187 True Fam19a1
30 Fam19a2 ENSMUSG00000044071 Tafa2 TAFA chemokine like family member 2 [Source:MGI Symbol;Acc:MGI:2143691] TAFA chemokine like family member 2 protein_coding ['Fam19a2', 'Sam2', 'Tafa-2', 'Tafa2'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000044071 True Fam19a2
31 Fam19a5 ENSMUSG00000054863 Tafa5 TAFA chemokine like family member 5 [Source:MGI Symbol;Acc:MGI:2146182] TAFA chemokine like family member 5 protein_coding ['Fam19a5', 'Tara-5'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000054863 True Fam19a5
32 Fam212b ENSMUSG00000048458 Inka2 inka box actin regulator 2 [Source:MGI Symbol;Acc:MGI:1923497] inka box actin regulator 2 protein_coding ['6530418L21Rik', 'Fam212b', 'Inka2'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000048458 True Fam212b
33 Fam46a ENSMUSG00000032265 Tent5a terminal nucleotidyltransferase 5A [Source:MGI Symbol;Acc:MGI:2670964] terminal nucleotidyltransferase 5A protein_coding ['BAP014', 'D930050G01Rik', 'Fam46a'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000032265 True Fam46a
34 Fam84a ENSMUSG00000020607 Lratd1 LRAT domain containing 1 [Source:MGI Symbol;Acc:MGI:2145011] LRAT domain containing 1 protein_coding ['2310003N02Rik', '4731402F03Rik', 'Fam84a'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000020607 True Fam84a
35 Fam84b ENSMUSG00000072568 Lratd2 LRAT domain containing 1 [Source:MGI Symbol;Acc:MGI:3026924] LRAT domain containing 1 protein_coding ['D330050I23Rik', 'Fam84b'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000072568 True Fam84b
36 Gad1-ps ENSMUSG00000090665 Gad1-ps glutamate decarboxylase 1, pseudogene [Source:MGI Symbol;Acc:MGI:95633] glutamate decarboxylase 1, pseudogene processed_pseudogene ['Gad-1ps'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000090665 False Gad1-ps
37 Gpr123 ENSMUSG00000025475 Adgra1 adhesion G protein-coupled receptor A1 [Source:MGI Symbol;Acc:MGI:1277167] adhesion G protein-coupled receptor A1 protein_coding ['Gpr123'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000025475 True Gpr123
38 Gpr125 ENSMUSG00000029090 Adgra3 adhesion G protein-coupled receptor A3 [Source:MGI Symbol;Acc:MGI:1917943] adhesion G protein-coupled receptor A3 protein_coding ['Gpr125'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000029090 True Gpr125
39 Gpr126 ENSMUSG00000039116 Adgrg6 adhesion G protein-coupled receptor G6 [Source:MGI Symbol;Acc:MGI:1916151] adhesion G protein-coupled receptor G6 protein_coding ['Gpr126'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000039116 True Gpr126
40 Gpr133 ENSMUSG00000044017 Adgrd1 adhesion G protein-coupled receptor D1 [Source:MGI Symbol;Acc:MGI:3041203] adhesion G protein-coupled receptor D1 protein_coding ['Gpr133'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000044017 True Gpr133
41 Gucy1a3 ENSMUSG00000033910 Gucy1a1 guanylate cyclase 1, soluble, alpha 1 [Source:MGI Symbol;Acc:MGI:1926562] guanylate cyclase 1, soluble, alpha 1 protein_coding ['Gucy1a3'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000033910 True Gucy1a3
42 Hrasls ENSMUSG00000022525 Plaat1 phospholipase A and acyltransferase 1 [Source:MGI Symbol;Acc:MGI:1351473] phospholipase A and acyltransferase 1 protein_coding ['2810012B06Rik', 'A-C1', 'Hrasls'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000022525 True Hrasls
43 Hyi ENSMUSG00000006395 Hyi hydroxypyruvate isomerase (putative) [Source:MGI Symbol;Acc:MGI:1915430] hydroxypyruvate isomerase (putative) protein_coding ['2700033B16Rik', '6430559E15Rik'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000006395 False Hyi
44 I830012O16Rik ENSMUSG00000062488 Ifit3b interferon-induced protein with tetratricopeptide repeats 3B [Source:MGI Symbol;Acc:MGI:3698419] interferon-induced protein with tetratricopeptide repeats 3B protein_coding ['I830012O16Rik'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000062488 True I830012O16Rik
45 LOC100503338
46 LOC101056001
47 LOC102632463
48 LOC102633357
49 LOC102633724
50 LOC102634132
51 LOC102634502
52 LOC102635502
53 LOC102636041
54 LOC102636700
55 LOC102638670
56 LOC102638890
57 LOC102640573
58 LOC102643175
59 LOC105242578
60 LOC105242710
61 LOC105242740
62 LOC105243130
63 LOC105243282
64 LOC105243425
65 LOC105243542
66 LOC105244000
67 LOC105244162
68 LOC105244192
69 LOC105244376
70 LOC105245190
71 LOC105245295
72 LOC105245487
73 LOC105245838
74 LOC105245960
75 LOC105246064
76 LOC105246327
77 LOC105246694
78 LOC105246811
79 LOC105246832
80 LOC105247131
81 Lphn2 ENSMUSG00000028184 Adgrl2 adhesion G protein-coupled receptor L2 [Source:MGI Symbol;Acc:MGI:2139714] adhesion G protein-coupled receptor L2 protein_coding ['Lphn2'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000028184 True Lphn2
82 Lppr1 ENSMUSG00000063446 Plppr1 phospholipid phosphatase related 1 [Source:MGI Symbol;Acc:MGI:2445015] phospholipid phosphatase related 1 protein_coding ['E130309F12Rik', 'Lppr1', 'PRG-3'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000063446 True Lppr1
83 Lppr3 ENSMUSG00000035835 Plppr3 phospholipid phosphatase related 3 [Source:MGI Symbol;Acc:MGI:2388640] phospholipid phosphatase related 3 protein_coding ['BC005764', 'Lppr3'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000035835 True Lppr3
84 Mfsd4 ENSMUSG00000059149 Mfsd4a major facilitator superfamily domain containing 4A [Source:MGI Symbol;Acc:MGI:2442786] major facilitator superfamily domain containing 4A protein_coding ['A930031D07Rik', 'Mfsd4'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000059149 True Mfsd4
85 N28178 ENSMUSG00000036062 Phf24 PHD finger protein 24 [Source:MGI Symbol;Acc:MGI:2140712] PHD finger protein 24 protein_coding ['N28178'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000036062 True N28178
86 Nov ENSMUSG00000037362 Ccn3 cellular communication network factor 3 [Source:MGI Symbol;Acc:MGI:109185] cellular communication network factor 3 protein_coding ['C130088N23Rik', 'CCN3', 'Nov'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000037362 True Nov
87 Obfc1 ENSMUSG00000042694 Stn1 STN1, CST complex subunit [Source:MGI Symbol;Acc:MGI:1915581] STN1, CST complex subunit protein_coding ['Obfc1'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000042694 True Obfc1
88 Pcnxl2 ENSMUSG00000060212 Pcnx2 pecanex homolog 2 [Source:MGI Symbol;Acc:MGI:2445010] pecanex homolog 2 protein_coding ['Pcnxl2'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000060212 True Pcnxl2
89 Ppapdc1a ENSMUSG00000070366 Plpp4 phospholipid phosphatase 4 [Source:MGI Symbol;Acc:MGI:2685936] phospholipid phosphatase 4 protein_coding ['Ppapdc1a'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000070366 True Ppapdc1a
90 Ptchd2 ENSMUSG00000041544 Disp3 dispatched RND transporter family member 3 [Source:MGI Symbol;Acc:MGI:2444403] dispatched RND transporter family member 3 protein_coding ['Ptchd2'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000041544 True Ptchd2
91 Pvrl1 ENSMUSG00000032012 Nectin1 nectin cell adhesion molecule 1 [Source:MGI Symbol;Acc:MGI:1926483] nectin cell adhesion molecule 1 protein_coding ['Pvrl1'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000032012 True Pvrl1
92 Pvrl3 ENSMUSG00000022656 Nectin3 nectin cell adhesion molecule 3 [Source:MGI Symbol;Acc:MGI:1930171] nectin cell adhesion molecule 3 protein_coding ['Pvrl3'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000022656 True Pvrl3
93 Stmn1-rs2
94 Tcrb
95 Wbscr17 ENSMUSG00000034040 Galnt17 polypeptide N-acetylgalactosaminyltransferase 17 [Source:MGI Symbol;Acc:MGI:2137594] polypeptide N-acetylgalactosaminyltransferase 17 protein_coding ['Wbscr17'] https://useast.ensembl.org/mus_musculus/Gene/Summary?g=ENSMUSG00000034040 True Wbscr17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment