PubChem_SDQ_Bibliometrics

Count references in PubChem associated with a CID (e.g., from PubMed, Patent, Springer Nature, Thieme, and Wiley Collections)
% Vincent F. Scalfani, Serena C. Ralph, Ali Al Alshaikh, and Jason E. Bara
% The University of Alabama
% Tested with MATLAB R2020a, running Ubuntu 18.04 on March 30, 2020.
% PubChem SDQ is used internally by PubChem webpages and is still being rapidly developed.

Define the PubChem API and SDQ agent base URL

% PubChem API
api = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/';
% PubChem SDQ agent
sdq = 'https://pubchem.ncbi.nlm.nih.gov/sdq/sdqagent.cgi?outfmt=json&query=';
% set a longer web options timeout
% this prevents a MATLAB timeout if the server is slow to respond.
options_api = weboptions('Timeout', 30);
options_sdq = weboptions('Timeout', 60,'ContentType','json');
% Retrieve and display a PNG image of 1-Butyl-3-methyl-imidazolium; CID 2734162
CID_SS_query = '2734162';
Replace the above CID value (CID_SS_query) with a different CID number to customize.
CID_url = [api 'cid/' CID_SS_query '/PNG'];
[CID_img,map] = imread(CID_url);
imshow(CID_img,map)

Perform a Similarity Search

% Search for chemical structures by Similarity Search (SS),
% (2D Tanimoto threshold 95% to 1-Butyl-3-methyl-imidazolium; CID 2734162)
api = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/';
SS_url = [api 'fastsimilarity_2d/cid/' CID_SS_query '/cids/JSON?Threshold=95'];
SS_CIDs = webread(SS_url,options_api);
SS_CIDs = num2cell(SS_CIDs.IdentifierList.CID)
SS_CIDs = 249×1 cell
 1
112971008
2304622
361347
411448496
511424151
611171745
72734161
8118785
92734236
102734162
11529334
1211788435
1311245926
1411160028
155245884
162734168
1791210418
1887560886
1987559770
2087106874
2124766551
2217870330
2316720567
2415557008
2515255204
2612392681
2712392676
2811448364
2911277167
3011031767
3110608883
3210537570
3310513048
3410313448
3510313447
3610154187
374183883
38139254006
39134345956
40122625623
41121299516
42118952202
43118057427
44117890836
45117703152
46117684660
47102147231
4890912888
4989713026
5089678233
5189432682
5288864524
5388236103
5487942618
5587806569
5687790333
5787789992
5887789923
5987789740
6087754289
6187754264
6287690425
6387688227
6487572548
6587572214
6687572213
6787509019
6887397668
6987388314
7087325711
7187308565
7287222859
7387181405
7487181202
7587181050
7687173651
7787125511
7887125508
7987121545
8087121544
8187121543
8287121443
8387121324
8487121318
8587121317
8687121316
8787121297
8887121296
8987121295
9087105369
9187099925
9287096071
9387092336
9469317070
9568379078
9667674484
9766751376
9860860613
9960103428
10059872702
In the above SS_url value, you can adjust to the desired Tanimoto threshold (e.g., 90)
% set a CID limit to 25 max
number_SS_CIDs = length(SS_CIDs)
number_SS_CIDs = 249
A limit of 25 was added as an initial testing safety for time consideration. This limit can be increased.
if number_SS_CIDs > 25
SS_CIDs = SS_CIDs(1:25)
else
disp('Number of SS_CIDs not changed')
end
SS_CIDs = 25×1 cell
 1
112971008
2304622
361347
411448496
511424151
611171745
72734161
8118785
92734236
102734162
11529334
1211788435
1311245926
1411160028
155245884
162734168
1791210418
1887560886
1987559770
2087106874
2124766551
2217870330
2316720567
2415557008
2515255204

Retrieve Isomeric SMILES for CIDs, Number of Substances, and Literature

% setup a for loop that processes each CID one-by-one.
for j = 1:length(SS_CIDs)
CID = SS_CIDs{j};
% define api call
api = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/';
% define api call for isomeric SMILES
CID_IsoSMILES_url = [api 'cid/' num2str(CID) '/property/IsomericSMILES/TXT'];
% retrieve isomeric SMILES
try
CID_IsoSMILES = webread(CID_IsoSMILES_url,options_api);
catch ME
CID_IsoSMILES = 'not found'
end
n = 0.5;
pause(n)
% add isomeric SMILES data to SS_CIDs data array
% j increases by 1 on each iteration, so the first CID Isomeric SMILES
% gets added to {1,2}, the second to {2,2}, third to {3,2}, etc.
SS_CIDs{j,2} = CID_IsoSMILES;
% define sdq call to retrieve count data
sdq = 'https://pubchem.ncbi.nlm.nih.gov/sdq/sdqagent.cgi?outfmt=json&query=';
litCountQ_url = [sdq '{"hide":"*","collection":"*","where":{"ands":{"cid":"' num2str(CID) '"}}}'];
try
litCountQ = webread(litCountQ_url, options_sdq);
catch ME
litCountQ = 'not found'
end
n = 1;
pause(n)
% add selected collection count data to SS_CIDs data array
% here the column index values on the left (e.g., {j,3})
% represent where the data will be stored (column 3), and the values
% on the right (e.g., litCountQ.SDQOutputSet{2,1}), is an index value (2),
% to retrieve the substance count data from the litCountQ.SDQOutputSet
% structure array.
% substance
SS_CIDs{j,3} = litCountQ.SDQOutputSet{2,1}.totalCount;
% add the associated collection row value as a manual validation check
SS_CIDs{j,4} = litCountQ.SDQOutputSet{2,1}.collection;
% patent
SS_CIDs{j,5} = litCountQ.SDQOutputSet{4,1}.totalCount;
SS_CIDs{j,6} = litCountQ.SDQOutputSet{4,1}.collection;
% pubmed
SS_CIDs{j,7} = litCountQ.SDQOutputSet{7,1}.totalCount;
SS_CIDs{j,8} = litCountQ.SDQOutputSet{7,1}.collection;
% thiemechemistry
SS_CIDs{j,9} = litCountQ.SDQOutputSet{13,1}.totalCount;
SS_CIDs{j,10} = litCountQ.SDQOutputSet{13,1}.collection;
% springernature
SS_CIDs{j,11} = litCountQ.SDQOutputSet{15,1}.totalCount;
SS_CIDs{j,12} = litCountQ.SDQOutputSet{15,1}.collection;
% wiley
SS_CIDs{j,13} = litCountQ.SDQOutputSet{14,1}.totalCount;
SS_CIDs{j,14} = litCountQ.SDQOutputSet{14,1}.collection;
end
For adding custom bibliometric data counts:
% display all available collections
collections = cell(1,length(litCountQ.SDQOutputSet)); % preallocate
for k = 1:length(litCountQ.SDQOutputSet)
collections{k} = litCountQ.SDQOutputSet{k,1}.collection;
end
collections = collections'
collections = 33×1 cell
'compound'
'substance'
'assay'
'patent'
'pathway'
'disease'
'pubmed'
'targetprotein'
'targetgene'
'targettaxonomy'
  1. Determine the collection you are interested in (e.g., 'clinicaltrials') from the litCountQ.SDQOutputSet structure
  2. Then, record the row number the collection appears in (18 for 'clinicaltrials')
  3. Next, use this number to index into the litCountQ.SDQOutputSet
  4. For example, litCountQ.SDQOutputSet{18,1}.totalCount
  5. Add the new bibliometric data count into the for loop above.

Validate Extracted Counts and Fields

% convert cell array to string and remove leading and trailing white space
SS_CIDs_string = strtrim(string(SS_CIDs));
% convert to table and verify expected counts match extracted collection
% e.g., in the patent_lab column all rows should be 'patent'
SSq_bibtable_validate = array2table(SS_CIDs_string, 'VariableNames',{'CID','Isomeric_SMILES','num_substances',...
'substances_lab','num_patent','patent_lab','num_pubmed','pubmed_lab','num_thiemechemistry',...
'thiemechemistry_lab','num_springernature','springernature_lab','num_wiley','wiley_lab'})
SSq_bibtable_validate = 25×14 table
 CIDIsomeric_SMILESnum_substancessubstances_labnum_patentpatent_labnum_pubmedpubmed_labnum_thiemechemistrythiemechemistry_labnum_springernaturespringernature_labnum_wileywiley_lab
1"12971008""CCCN1C=C[N+](=C1)C.[I-]""69""substance""104""patent""0""pubmed""0""thiemechemistry""142""springernature""5""wiley"
2"304622""CCCCN1C=CN=C1C""51""substance""126""patent""7""pubmed""0""thiemechemistry""7""springernature""0""wiley"
3"61347""CCCCN1C=CN=C1""103""substance""1259""patent""21""pubmed""13""thiemechemistry""116""springernature""5""wiley"
4"11448496""CCCCN1C=C[N+](=C1)C.[I-]""58""substance""95""patent""0""pubmed""2""thiemechemistry""133""springernature""2""wiley"
5"11424151""CCCCN1C=C[N+](=C1)C.C(#N)[S-]""49""substance""143""patent""2""pubmed""0""thiemechemistry""32""springernature""1""wiley"
6"11171745""CCCCN1C=C[N+](=C1)C.C(=[N-])=NC#N""56""substance""4""patent""0""pubmed""0""thiemechemistry""77""springernature""0""wiley"
7"2734161""CCCCN1C=C[N+](=C1)C.[Cl-]""113""substance""772""patent""323""pubmed""2""thiemechemistry""1182""springernature""9""wiley"
8"118785""CCCN1C=CN=C1""94""substance""1263""patent""3""pubmed""2""thiemechemistry""36""springernature""0""wiley"
9"2734236""CCCCN1C=C[N+](=C1)C.[Br-]""84""substance""227""patent""323""pubmed""1""thiemechemistry""380""springernature""22""wiley"
10"2734162""CCCCN1C=C[N+](=C1)C""60""substance""2263""patent""668""pubmed""30""thiemechemistry""0""springernature""1""wiley"
11"529334""CCCCCN1C=CN=C1""63""substance""1934""patent""1""pubmed""1""thiemechemistry""4""springernature""0""wiley"
12"11788435""CCCCN1C=C[N+](=C1)C.[OH-]""26""substance""26""patent""3""pubmed""3""thiemechemistry""117""springernature""11""wiley"
13"11245926""CCCCN1C=C[N+](=C1)C.[Br-].BrBr""9""substance""0""patent""0""pubmed""1""thiemechemistry""1""springernature""0""wiley"
14"11160028""CCCN1C=C[N+](=C1)C.[Br-]""24""substance""21""patent""1""pubmed""0""thiemechemistry""11""springernature""0""wiley"
15"5245884""CCCN1C=C[N+](=C1)C""30""substance""350""patent""2""pubmed""1""thiemechemistry""0""springernature""1""wiley"
16"2734168""CCCCN1C=C[N+](=C1C)C""32""substance""564""patent""29""pubmed""2""thiemechemistry""0""springernature""1""wiley"
17"91210418""CCCCN1C=C[N+](=C1I)C""2""substance""1""patent""1""pubmed""0""thiemechemistry""0""springernature""0""wiley"
18"87560886""CCCC[N+]1=CN(C=C1)C=C.[Br-]""22""substance""4""patent""0""pubmed""0""thiemechemistry""11""springernature""2""wiley"
19"87559770""CCCC[N+]1=CN(C=C1)C=C.[Cl-]""9""substance""2""patent""0""pubmed""0""thiemechemistry""9""springernature""2""wiley"
20"87106874""CCCCCN1C=C[N+](=C1)CCCCC""3""substance""21""patent""1""pubmed""0""thiemechemistry""0""springernature""0""wiley"
21"24766551""CCCC[N+]1=CN(C=C1)C=C""8""substance""8""patent""0""pubmed""0""thiemechemistry""0""springernature""0""wiley"
22"17870330""CN(C)CCCN1C=CN=C1""14""substance""48""patent""0""pubmed""0""thiemechemistry""0""springernature""0""wiley"
23"16720567""CCCCN1C=C[N+](=C1)CCC.[Br-]""6""substance""1""patent""0""pubmed""0""thiemechemistry""1""springernature""0""wiley"
24"15557008""CCCC1=NC=CN1CC""7""substance""7""patent""0""pubmed""0""thiemechemistry""2""springernature""0""wiley"
25"15255204""CCCCN1C=C[N+](=C1)CCCC.[Cl-]""4""substance""6""patent""0""pubmed""0""thiemechemistry""0""springernature""1""wiley"

Compile Bibliometric Data into a Table

% select only the numerical count data to export
SSq_bibtable = SSq_bibtable_validate(:, {'Isomeric_SMILES' 'CID' 'num_substances' 'num_patent' 'num_pubmed'...
'num_thiemechemistry' 'num_springernature' 'num_wiley'})
SSq_bibtable = 25×8 table
 Isomeric_SMILESCIDnum_substancesnum_patentnum_pubmednum_thiemechemistrynum_springernaturenum_wiley
1"CCCN1C=C[N+](=C1)C.[I-]""12971008""69""104""0""0""142""5"
2"CCCCN1C=CN=C1C""304622""51""126""7""0""7""0"
3"CCCCN1C=CN=C1""61347""103""1259""21""13""116""5"
4"CCCCN1C=C[N+](=C1)C.[I-]""11448496""58""95""0""2""133""2"
5"CCCCN1C=C[N+](=C1)C.C(#N)[S-]""11424151""49""143""2""0""32""1"
6"CCCCN1C=C[N+](=C1)C.C(=[N-])=NC#N""11171745""56""4""0""0""77""0"
7"CCCCN1C=C[N+](=C1)C.[Cl-]""2734161""113""772""323""2""1182""9"
8"CCCN1C=CN=C1""118785""94""1263""3""2""36""0"
9"CCCCN1C=C[N+](=C1)C.[Br-]""2734236""84""227""323""1""380""22"
10"CCCCN1C=C[N+](=C1)C""2734162""60""2263""668""30""0""1"
11"CCCCCN1C=CN=C1""529334""63""1934""1""1""4""0"
12"CCCCN1C=C[N+](=C1)C.[OH-]""11788435""26""26""3""3""117""11"
13"CCCCN1C=C[N+](=C1)C.[Br-].BrBr""11245926""9""0""0""1""1""0"
14"CCCN1C=C[N+](=C1)C.[Br-]""11160028""24""21""1""0""11""0"
15"CCCN1C=C[N+](=C1)C""5245884""30""350""2""1""0""1"
16"CCCCN1C=C[N+](=C1C)C""2734168""32""564""29""2""0""1"
17"CCCCN1C=C[N+](=C1I)C""91210418""2""1""1""0""0""0"
18"CCCC[N+]1=CN(C=C1)C=C.[Br-]""87560886""22""4""0""0""11""2"
19"CCCC[N+]1=CN(C=C1)C=C.[Cl-]""87559770""9""2""0""0""9""2"
20"CCCCCN1C=C[N+](=C1)CCCCC""87106874""3""21""1""0""0""0"
21"CCCC[N+]1=CN(C=C1)C=C""24766551""8""8""0""0""0""0"
22"CN(C)CCCN1C=CN=C1""17870330""14""48""0""0""0""0"
23"CCCCN1C=C[N+](=C1)CCC.[Br-]""16720567""6""1""0""0""1""0"
24"CCCC1=NC=CN1CC""15557008""7""7""0""0""2""0"
25"CCCCN1C=C[N+](=C1)CCCC.[Cl-]""15255204""4""6""0""0""0""1"
% export data as tabbed text file
% prompt user to select folder for data export
save_folder = uigetdir;
% change directory to selected folder
cd(save_folder)
writetable(SSq_bibtable,'MATLAB_SDQ_Bibliometrics_results.txt','Delimiter','tab')