<li>I see 167,000 hits from a bunch of Microsoft IPs with reverse DNS “msnbot-” using the Solr query <code>dns:*msnbot* AND dns:*.msn.com</code></li>
<li>I purged these first so I could see the other “real” IPs in the Solr facets</li>
</ul>
</li>
<li>I see 47,500 hits from 80.248.237.167 on a data center ISP in Sweden, using a normal user agent</li>
<li>I see 13,000 hits from 163.237.216.11 on a data center ISP in Australia, using a normal user agent</li>
<li>I see 7,300 hits from 208.185.238.57 from Britanica, using a normal user agent
<li>I see 3,000 hits from 178.208.75.33 by a Russian-owned IP in the Netherlands that is making a GET to / every one minute, using a normal user agent</li>
<li>I see 3,000 hits from 134.122.124.196 on Digital Ocean to the REST API with a normal user agent</li>
<li>I purged all these hits from IPs for a total of about 265,000</li>
<li>Then I faceted by user agent and found
<ul>
<li>1,000 hits by <code>insomnia/2022.2.1</code>, which I also saw last month and submitted to COUNTER-Robots</li>
<li>265 hits by <code>omgili/0.5 +http://omgili.com</code></li>
<li>150 hits by <code>Vizzit</code></li>
<li>132 hits by <code>MetaInspector/5.7.0 (+https://github.com/jaimeiniesta/metainspector)</code></li>
<li>73 hits by <code>Scoop.it</code></li>
<li>62 hits by <code>bitdiscovery</code></li>
<li>59 hits by <code>Asana/1.4.0 WebsiteMetadataRetriever</code></li>
<li>32 hits by <code>Sprout Social (Link Attachment)</code></li>
<li>29 hits by <code>CyotekWebCopy/1.9 CyotekHTTP/6.2</code></li>
<li>20 hits by <code>Hootsuite-Authoring/1.0</code></li>
</ul>
</li>
<li>I purged about 4,100 hits from these user agents</li>
<li>Create a user for Mohammed Salem to test MEL submission on DSpace Test:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>$ dspace user -a -m mel-submit@cgiar.org -g MEL -s Submit -p <spanstyle="color:#e6db74">'owwwwwwww'</span>
</span></span></code></pre></div><ul>
<li>According to my notes from <ahref="/cgspace-notes/2020-10/">2020-10</a> the account must be in the admin group in order to submit via the REST API</li>
<li>Francesca asked us to add the CC-BY-3.0-IGO license to the submission form on CGSpace
<ul>
<li>I remember I <ahref="https://github.com/spdx/license-list-XML/issues/767">had requested SPDX to add CC-BY-NC-ND-3.0-IGO</a> in 2019-02, and they finally <ahref="https://github.com/spdx/license-list-XML/pull/1068">merged it</a> in 2020-07, but I never added it to CGSpace</li>
<li>I will add the full suite of CC 3.0 IGO licenses to CGSpace and then make a request to SPDX for the others:
- CC-BY-3.0-IGO
- CC-BY-SA-3.0-IGO
- CC-BY-ND-3.0-IGO
- CC-BY-NC-3.0-IGO
- CC-BY-NC-SA-3.0-IGO
- CC-BY-NC-ND-3.0-IGO</li>
</ul>
</li>
<li>I filed <ahref="https://github.com/spdx/license-list-XML/issues/1525">an issue asking for SPDX to add CC-BY-3.0-IGO</a></li>
<li>Meeting with Moayad from CodeObia to discuss OpenRXV
<ul>
<li>He added the ability to use multiple indexes / dashboards, and to be able to embed them in iframes</li>
</ul>
</li>
<li>Add <code>cg.contributor.initiative</code> with a controlled vocabulary based on CLARISA’s list to the CGSpace submission form</li>
<li>I am seeing strange results with csvjoin’s <code>--outer</code> join that I need to keep unmatched terms from both left and right files…
<ul>
<li>Using <code>xsv join --full</code> is giving me better results:</li>
<li>Start working on the CGSpace subject export for FAO</li>
<li>First I exported a list of all metadata in our <code>dcterms.subject</code> and other center-specific subject fields with their counts:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value AS "subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119) GROUP BY "subject" ORDER BY count DESC) to /tmp/2022-06-28-cgspace-subjects.csv WITH CSV HEADER;
<li>I keep getting timeouts after every five or ten requests, so this will not be feasible for 27,000 subjects!</li>
<li>I think I will have to write some custom script to use the AGROVOC RDF file
<ul>
<li>Using rdflib to open the 1.2GB <code>agrovoc_lod.rdf</code> file takes several minutes and doesn’t seem very efficient</li>
</ul>
</li>
<li>I tried using <ahref="https://github.com/ozekik/lightrdf">lightrdf</a> and it’s much quicker, but the documentation is limiting and I’m not sure how to search yet
<ul>
<li>I had to try in different Python versions because 3.10.x is apparently too new</li>
</ul>
</li>
<li>For future reference I was able to search with lightrdf:</li>
<li>Looking at the non-AGROVOC subjects again, I see some in our list that are duplicated in uppercase and lowercase, so I will run it again with all lowercase:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119) GROUP BY "subject" ORDER BY count DESC) to /tmp/2022-06-30-cgspace-subjects.csv WITH CSV HEADER;
</span></span></code></pre></div><ul>
<li>Also, I see there might be something wrong with my csvjoin because nigeria shows up in the final list as having not matched…
<ul>
<li>Ah, I was using <code>csvgrep -m 0</code> to find rows that didn’t match, but that also matched items that had 10, 100, 50, etc…</li>
<li>We need to use a regex:</li>
</ul>
</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>$ csvgrep -c <spanstyle="color:#e6db74">'number of matches'</span> -r <spanstyle="color:#e6db74">'^0$'</span> /tmp/2022-06-30-cgspace-subjects-results.csv <spanstyle="color:#ae81ff">\
<li>Then I took all the terms with fifty or more occurences and put them on a Google Sheet
<ul>
<li>There I started removing any term that was a variation of an existing AGROVOC term (like cowpea/cowpeas, policy/policies) or a compound concept</li>
<li>pnbecker on DSpace Slack mentioned that they made a JSPUI deduplication step that is open source: <ahref="https://github.com/the-library-code/deduplication">https://github.com/the-library-code/deduplication</a>
<ul>
<li>It uses Levenshtein distance via PostgreSQL’s fuzzystrmatch extension</li>