</span><spanstyle="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 54702
</code></pre></div><ul>
<li>Export donors and affiliations from CGSpace database:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><codeclass="language-console"data-lang="console">localhost/dspace63= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.donor", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 248 GROUP BY text_value ORDER BY count DESC) to /tmp/2022-02-02-donors.csv WITH CSV HEADER;
COPY 1036
localhost/dspace63= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2022-02-02-affiliations.csv WITH CSV HEADER;
COPY 7901
</code></pre></div><ul>
<li>Then check matches against the latest ROR dump:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><codeclass="language-console"data-lang="console">$ csvcut -c cg.contributor.donor /tmp/2022-02-02-donors.csv | sed <spanstyle="color:#e6db74">'1d'</span>> /tmp/2022-02-02-donors.txt
<li>I see we have 258/1036 (24.9%) of our donors matching ROR (as of the 2021-09-23 ROR dump)</li>
<li>I see we have 1986/7901 (25.1%) of our affiliations matching ROR (as of the 2021-09-23 ROR dump)</li>
<li>Update the PostgreSQL JDBC driver to 42.3.2 in the Ansible Infrastructure playbooks and deploy on DSpace Test</li>
<li>Mishell from CIP sent me a copy of a security scan their ICT had done on CGSpace using QualysGuard
<ul>
<li>The report was very long and generic, highlighting low-severity things like being able to post crap to search forms and have it appear on the results page</li>
<li>Also they say we’re using old jQuery and bootstrap, etc (fair enough) but there are no exploits per se</li>
<li>At least now I know why all those Qualys IPs are scanning us all the time!!!</li>
</ul>
</li>
<li>Mishell also said she’s having issues logging into CGSpace
<ul>
<li>According to the logs her account is failing on LDAP authentication</li>
<li>I checked CGSpace’s LDAP credentials using ldapsearch and was able to connect so it’s gotta be something with her account</li>
</ul>
</li>
</ul>
<h2id="2022-02-03">2022-02-03</h2>
<ul>
<li>I synchronized DSpace Test with a fresh snapshot of CGSpace</li>
<li>I noticed a bunch of thumbnails missing for items submitted in the last week on CGSpace so I ran the <code>dspace filter-media</code> script manually and eventually it crashed:</li>
</span><spanstyle="color:#960050;background-color:#1e0010"></span>FILTERED: bitstream 31db7d05-5369-4309-adeb-3b888c80b73d (item: 10568/67391) and created 'Agreement_on_the_Estab_of_ILRI.doc.txt'
</code></pre></div><ul>
<li>Meeting with the repositories working group to discuss issues moving forward in the One CGIAR</li>
</ul>
<h2id="2022-02-07">2022-02-07</h2>
<ul>
<li>Gaia sent me her feedback on the duplicates for the TAC and ICW items for CGSpace a few days ago
<ul>
<li>I used the IDs marked “delete” in her spreadsheet to create a custom text facet with this GREL in OpenRefine:</li>
<li>Then I flagged all of these (seventy-five items)…
<ul>
<li>I decided to flag the deletes instead of star the keeps because there are some items in the original file that we not marked as duplicates so we have to keep those too</li>
</ul>
</li>
<li>I generated the next batch of 200 items, from IDs 201 to 400, checked them for duplicates, and then added the PDF file names to the CSV for reference:</li>
$ csvjoin -c id /tmp/2022-02-07-tac-batch2-201-400.csv /tmp/batch2-filenames.csv > /tmp/2022-02-07-tac-batch2-201-400-filenames.csv
</code></pre></div><ul>
<li>Then I sent this second batch of items to Gaia to look at</li>
</ul>
<h2id="2022-02-08">2022-02-08</h2>
<ul>
<li>Create a SAF archive for the first 200 items (IDs 1 to 200) that were <em>not</em> flagged as duplicates and upload them to a <ahref="https://dspacetest.cgiar.org/handle/10568/117921">new collection on DSpace Test</a>:</li>
<li>Fix some occurrences of “Hammond, Jim” to be “Hammond, James” on CGSpace</li>
<li>Start a full index on AReS</li>
</ul>
<h2id="2022-02-09">2022-02-09</h2>
<ul>
<li>UptimeRobot said that CGSpace was down yesterday evening, but when I looked it was up and I didn’t see a high database load or anything wrong</li>
<li>Maria from Bioversity wrote to say that CGSpace was very slow also…</li>
</ul>
<h2id="2022-02-10">2022-02-10</h2>
<ul>
<li>Looking at the Munin graphs on CGSpace I see several metrics showing that there was likely just increased load…</li>
<li>I started resolving them with my <code>resolve-addresses-geoip2.py</code> script</li>
<li>In the mean time I am looking at the requests and I see a new user agent: <code>1science Resolver 1.0.0</code>
<ul>
<li>Seems to be a defunct project from Elsevier (website down, Twitter account inactive since 2020)</li>
</ul>
</li>
<li>I also see 3,400 requests from <code>EyeMonIT_bot_version_0.1_(http://www.eyemon.it/)</code>, but because it has “bot” in the name it gets heavily throttled…
<ul>
<li>I wonder who is monitoring CGSpace with that service…</li>
</ul>
</li>
<li>Looking at the top twenty or so ASNs for the resolved IPs I see lots of bot traffic, but nothing malicious:</li>
<li>Most of these are pretty normal except “Serverel” and Hetzner perhaps, but their user agents are pretending to be normal users so who knows…</li>
<li>I decided to look in the Solr stats with <code>facet.limit=1000&facet.mincount=1</code> and found a few more definitely non-human agents:
<ul>
<li>scalaj-http/2.4.2</li>
<li>scpitspi-rs</li>
<li>lua-resty-http</li>
<li>AHC/2.1</li>
<li>acebookexternalhit <—- typo, but purge it!!!</li>
<li>CGSpace is slow and the load has been over 400% for a few hours
<ul>
<li>The number of DSpace sessions seems normal, even lower than a few days ago</li>
<li>The number of PostgreSQL connections is low, but I see there are lots of “AccessShare” locks (green on Munin, not blue like usual)</li>
<li>I will run all system updates, copy the latest config changes, and restart the server</li>
</ul>
</li>
</ul>
<h2id="2022-02-12">2022-02-12</h2>
<ul>
<li>Install PostgreSQL 12 on my local dev environment to starting DSpace 6.x workflows with it:</li>
<li>Last week Gaia sent me her notes on the second batch of TAC/ICW documents (items 201–400 in the spreadsheet)
<ul>
<li>I created a filter in LibreOffice and selected the IDs for items with the action “delete”, then I created a custom text facet in OpenRefine with this GREL:</li>
</ul>
</li>
</ul>
<pretabindex="0"><code>or(
isNotNull(value.match('201')),
isNotNull(value.match('203')),
isNotNull(value.match('209')),
isNotNull(value.match('209')),
isNotNull(value.match('215')),
isNotNull(value.match('220')),
isNotNull(value.match('225')),
isNotNull(value.match('226')),
isNotNull(value.match('227')),
...
isNotNull(value.match('396'))
</code></pre><ul>
<li>Then I flagged all matching records and exported a CSV to use with SAFBuilder
<ul>
<li>Then I imported the SAF bundle on DSpace Test:</li>
# pg_ctlcluster <spanstyle="color:#ae81ff">10</span> main stop
# tar -cvzpf var-lib-postgresql-10.tar.gz /var/lib/postgresql/10
# tar -cvzpf etc-postgresql-10.tar.gz /etc/postgresql/10
# pg_ctlcluster <spanstyle="color:#ae81ff">12</span> main stop
# pg_dropcluster <spanstyle="color:#ae81ff">12</span> main
# pg_upgradecluster <spanstyle="color:#ae81ff">10</span> main
# pg_ctlcluster <spanstyle="color:#ae81ff">12</span> main start
</code></pre></div><ul>
<li>After that I <ahref="https://adamj.eu/tech/2021/04/13/reindexing-all-tables-after-upgrading-to-postgresql-13/">re-indexed the database indexes using a query</a>:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><codeclass="language-console"data-lang="console">$ su - postgres
$ <trim the extra stuff from /tmp/reindex.sql>
$ psql dspace < /tmp/reindex.sql
</code></pre></div><ul>
<li>I saw that the index on <code>metadatavalue</code> shrunk by about 200MB!</li>
<li>After testing a few things I dropped the old cluster:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><codeclass="language-console"data-lang="console"># pg_dropcluster <spanstyle="color:#ae81ff">10</span> main
<li>I updated my <code>migrate-fields.sh</code> script to use field names instead of IDs
<ul>
<li>The script now looks up the appropriate <code>metadata_field_id</code> values for each field in the metadata registry</li>
</ul>
</li>
</ul>
<h2id="2022-02-18">2022-02-18</h2>
<ul>
<li>Normalize the <code>text_lang</code> attributes of metadata on CGSpace:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><codeclass="language-console"data-lang="console">dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
text_lang | count
-----------+---------
en_US | 2838588
en | 1082
| 801
fr | 2
vn | 2
en_US. | 1
sp | 1
| 0
(8 rows)
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en', 'en_US.', '');
UPDATE 1884
dspace=# UPDATE metadatavalue SET text_lang='vi' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('vn');
UPDATE 2
dspace=# UPDATE metadatavalue SET text_lang='es' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('sp');
UPDATE 1
</code></pre></div><ul>
<li>I then exported the entire repository and did some cleanup on DOIs
<ul>
<li>I found ~1,200 items with no <code>cg.identifier.doi</code>, but which had a DOI in their citation</li>
<li>I cleaned up and normalized a few hundred others to use <ahref="https://doi.org">https://doi.org</a> format</li>
</ul>
</li>
<li>I’m debating using the Crossref API to search for our DOIs and improve our metadata
<li>There is good data on publishers, issue dates, volume/issue, and sometimes even licenses</li>
</ul>
</li>
<li>I cleaned up ~1,200 URLs that were using HTTP instead of HTTPS, fixed a bunch of handles, removed some handles from DOI field, etc</li>
</ul>
<h2id="2022-02-20">2022-02-20</h2>
<ul>
<li>Yesterday I wrote a script to check our DOIs against Crossref’s API and the did some investigation on dates, volumes, issues, pages, and types
<ul>
<li>While investigating issue dates in OpenRefine I created a new column using this GREL to show the number of days between Crossref’s date and ours:</li>
<li>In <em>most</em> cases Crossref’s dates are more correct than ours, though there are a few odd cases that I don’t know what strategy I want to use yet</li>
<li>Start a full harvest on AReS</li>
</ul>
<h2id="2022-02-21">2022-02-21</h2>
<ul>
<li>I added support for checking the license of DOIs to my Crossref script
<ul>
<li>I exported ~2,800 DOIs and ran a check on them, then merged the CGSpace CSV with the results of the script to inspect in OpenRefine</li>
<li>There are hundreds of DOIs missing licenses in our data, even in this small subset of ~2,800 (out of 19,000 on CGSpace)</li>
<li>I spot checked a few dozen in Crossref’s data and found some incorrect ones, like on Elsevier, Wiley, and Sage journals</li>
<li>I ended up using a series of GREL expressions in OpenRefine that ended up filtering out DOIs from these prefixes:</li>
<li>Update some audience metadata on CGSpace:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><codeclass="language-console"data-lang="console">dspace=# UPDATE metadatavalue SET text_value='Academics' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value = 'Academicians';
UPDATE 354
dspace=# UPDATE metadatavalue SET text_value='Scientists' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value = 'SCIENTISTS';