<li>I was trying to add Felix Shaw’s account back to the Administrators group on DSpace Test, but I couldn’t find his name in the user search of the groups page
<ul>
<li>If I searched for “Felix” or “Shaw” I saw other matches, included one for his personal email address!</li>
<li>I ended up finding him via searching for his email address</li>
<li><p>I created a pull request and merged the changes to the <code>5_x-prod</code> branch (<ahref="https://github.com/ilri/DSpace/pull/417">#417</a>)</p></li>
<li>The other thing visible there is that the past few days the load has spiked to 500% and I don’t think it’s a coincidence that the Solr updating thing is happening…</li>
<li>Udana asked why item <ahref="https://cgspace.cgiar.org/handle/10568/91278"><sup>10568</sup>⁄<sub>91278</sub></a> didn’t have an Altmetric badge on CGSpace, but on the <ahref="https://wle.cgiar.org/food-and-agricultural-innovation-pathways-prosperity">WLE website</a> it does
<ul>
<li>I looked and saw that the WLE website is using the Altmetric score associated with the DOI, and that the Handle has no score at all</li>
<li>I tweeted the item and I assume this will link the Handle with the DOI in the system</li>
<li><p>Linode sent an alert that there was high CPU usage this morning on CGSpace (linode18) and these were the top IPs in the webserver access logs around the time:</p>
<li><p><code>45.5.184.72</code> is in Colombia so it’s probably CIAT, and I see they are indeed trying to get crawl the Discover pages on CIAT’s datasets collection:</p>
<li><p>Their user agent is the one I added to the badbots list in nginx last week: “GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1”</p></li>
<li><p>They made 22,000 requests to Discover on this collection today alone (and it’s only 11AM):</p>
<li><p>I need to find a contact at CIAT to tell them to use the REST API rather than crawling Discover</p></li>
<li><p>Maria from Bioversity recommended that we use the phrase “AGROVOC subject” instead of “Subject” in Listings and Reports</p>
<li>I made a pull request to update this and merged it to the <code>5_x-prod</code> branch (<ahref="https://github.com/ilri/DSpace/pull/418">#418</a>)</li>
<li>Looking into the impact of harvesters like <code>45.5.184.72</code>, I see in Solr that this user is not categorized as a bot so it definitely impacts the usage stats by some tens of thousands <em>per day</em></li>
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-03&rows=0&wt=json&indent=true'
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-04&rows=0&wt=json&indent=true'
<li><p>Making some tests on GET vs HEAD requests on the <ahref="https://dspacetest.cgiar.org/handle/10568/100289">CTA Spore 192 item</a> on DSpace Test:</p>
<li><p>After twenty minutes of waiting I still don’t see any new requests in the statistics core, but when I try the requests from the command line again I see the following in the DSpace log:</p>
<li><p>Strangely, the statistics Solr core says it hasn’t been modified in 24 hours, so I tried to start the “optimize” process from the Admin UI and I see this in the Solr log:</p>
<li><p>Ugh, even after optimizing there are no Solr results for requests from my IP, and actually I only see 18 results from 2019-04 so far and none of them are <code>statistics_type:view</code>… very weird</p>
<li>I don’t even see many hits for days after 2019-03-17, when I migrated the server to Ubuntu 18.04 and copied the statistics core from CGSpace (linode18)</li>
<li>I will try to re-deploy the <code>5_x-dev</code> branch and test again</li>
<li><p>According to the <ahref="https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics">DSpace 5.x Solr documentation</a> the default commit time is after 15 minutes or 10,000 documents (see <code>solrconfig.xml</code>)</p></li>
<li><p>I looped some GET and HEAD requests to a bitstream on my local instance and after some time I see that they <em>do</em> register as downloads (even though they are internal):</p>
<pre><code>$ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&fq=statistics_type%3Aview&fq=isInternal%3Atrue&rows=0&wt=json&indent=true'
<li><p>I confirmed the same on CGSpace itself after making one HEAD request</p></li>
<li><p>So I’m pretty sure it’s something about DSpace Test using the CGSpace statistics core, and not that I deployed Solr 4.10.4 there last week</p>
<li>I deployed Solr 4.10.4 locally and ran a bunch of requests for bitstreams and they do show up in the Solr statistics log, so the issue must be with re-using the existing Solr core from CGSpace</li>
<li><p>Now this gets more frustrating: I did the same GET and HEAD tests on a local Ubuntu 16.04 VM with Solr 4.10.2 and 4.10.4 and the statistics are recorded</p>
<li>DSpace 5.10 upgraded to use GeoIP2, but we are on 5.8 so I just copied the missing database file from another server because it has been <em>removed</em> from MaxMind’s server as of 2018-04-01</li>
<li>Now I made 100 requests and I see them in the Solr statistics… fuck my life for wasting five hours debugging this</li>
<li><p>UptimeRobot said CGSpace went down and up a few times tonight, and my first instict was to check <code>iostat 1 10</code> and I saw that CPU steal is around 10–30 percent right now…</p></li>
<li><p>The load average is super high right now, as I’ve noticed the last few times UptimeRobot said that CGSpace went down:</p>
<li><p>Start checking IITA’s last round of batch uploads from <ahref="https://dspacetest.cgiar.org/handle/10568/100333">March on DSpace Test</a> (20193rd.xls)</p>
<li>The matched values can be accessed with <code>cell.recon.match.name</code>, but some of the new values don’t appear, perhaps because I edited the original cell values?</li>
<li><p>See the <ahref="https://github.com/OpenRefine/OpenRefine/wiki/Variables#recon">OpenRefine variables documentation</a> for more notes about the <code>recon</code> object</p></li>
<li><p>I also noticed a handful of errors in our current list of affiliations so I corrected them:</p>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 211 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-04-08-top-1500-affiliations.csv WITH CSV HEADER;
<pre><code>dspace=# UPDATE metadatavalue SET text_value='International Institute for Environment and Development' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'International Institute^M%';
dspace=# UPDATE metadatavalue SET text_value='Kenya Agriculture and Livestock Research Organization' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'Kenya Agricultural and Livestock Research^M%';
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE '%’%') to /tmp/2019-04-08-affiliations-apostrophes.csv WITH CSV HEADER;
COPY 60
dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 57 AND text_value LIKE '%’%') to /tmp/2019-04-08-subject-apostrophes.csv WITH CSV HEADER;
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-319] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
<li>Linode Support still didn’t respond to my ticket from yesterday, so I attached a new output of <code>iostat 1 10</code> and asked them to move the VM to a less busy host</li>
<li><p><code>45.5.186.2</code> is at CIAT in Colombia and I see they are mostly making requests to the REST API, but also to XMLUI with the following user agent:</p>
<li>Abenet pointed out a possibility of validating funders against the <ahref="https://support.crossref.org/hc/en-us/articles/215788143-Funder-data-via-the-API">CrossRef API</a></li>
<li><p>Otherwise, they provide the funder data in <ahref="https://www.crossref.org/services/funder-registry/">CSV and RDF format</a></p></li>
<li><p>I did a quick test with the recent IITA records against reconcile-csv in OpenRefine and it matched a few, but the ones that didn’t match will need a human to go and do some manual checking and informed decision making…</p></li>
<li><p>If I want to write a script for this I could use the Python <ahref="https://habanero.readthedocs.io/en/latest/modules/crossref.html">habanero library</a>:</p>
<li>Continue proofing IITA’s last round of batch uploads from <ahref="https://dspacetest.cgiar.org/handle/10568/100333">March on DSpace Test</a> (20193rd.xls)
<li>Potential duplicates (same DOI, similar title, same authors):</li>
<li><ahref="https://dspacetest.cgiar.org/handle/10568/100580"><sup>10568</sup>⁄<sub>100580</sub></a> and <ahref="https://dspacetest.cgiar.org/handle/10568/100579"><sup>10568</sup>⁄<sub>100579</sub></a></li>
<li><ahref="https://dspacetest.cgiar.org/handle/10568/100444"><sup>10568</sup>⁄<sub>100444</sub></a> and <ahref="https://dspacetest.cgiar.org/handle/10568/100423"><sup>10568</sup>⁄<sub>100423</sub></a></li>
<li>Two DOIs with incorrect URL formatting</li>
<li>Two misspelled IITA subjects</li>
<li>Two authors with smart quotes</li>
<li>Lots of issues with sponsors</li>
<li>One misspelled “Peer review”</li>
<li>One invalid ISBN that I fixed by Googling the title</li>
<li>Lots of issues with sponsors (German Aid Agency, Swiss Aid Agency, Italian Aid Agency, Dutch Aid Agency, etc)</li>
<li>I validated all the AGROVOC subjects against our latest list with reconcile-csv</li>
<li>About 720 of the 900 terms were matched, then I checked and fixed or deleted the rest manually</li>
<li><p>I captured a few general corrections and deletions for AGROVOC subjects while looking at IITA’s records, so I applied them to DSpace Test and CGSpace:</p>
<li>They can’t seem to understand the Altmetric + Twitter flow for associating Handles and DOIs</li>
<li>To make things worse, many of their items DON’T have DOIs, so when Altmetric harvests them of course there is no link! - Then, a bunch of their items don’t have scores because they never tweeted them!</li>
<li>They added a DOI to this old item <ahref="https://cgspace.cgiar.org/handle/10568/97087"><sup>10567</sup>⁄<sub>97087</sub></a> this morning and wonder why Altmetric’s score hasn’t linked with the DOI magically</li>
<li>We should check in a week to see if Altmetric will make the association after one week when they harvest again</li>
<li>I copied the <code>statistics</code> and <code>statistics-2018</code> Solr cores from CGSpace to my local machine and watched the Java process in VisualVM while indexing item views and downloads with my <ahref="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a>:</li>
</ul>
<p><imgsrc="/cgspace-notes/2019/04/visualvm-solr-indexing.png"alt="Java GC during Solr indexing with CMS"/></p>
<ul>
<li>It took about eight minutes to index 784 pages of item views and 268 of downloads, and you can see a clear “sawtooth” pattern in the garbage collection</li>
<li>I am curious if the GC pattern would be different if I switched from the <code>-XX:+UseConcMarkSweepGC</code> to G1GC</li>
<li>I switched to G1GC and restarted Tomcat but for some reason I couldn’t see the Tomcat PID in VisualVM…
<ul>
<li>Anyways, the indexing process took much longer, perhaps twice as long!</li>
</ul></li>
<li>I tried again with the GC tuning settings from the Solr 4.10.4 release:</li>
</ul>
<p><imgsrc="/cgspace-notes/2019/04/visualvm-solr-indexing-solr-settings.png"alt="Java GC during Solr indexing Solr 4.10.4 settings"/></p>
<li>Pretty annoying to see CGSpace (linode18) with 20–50% CPU steal according to <code>iostat 1 10</code>, though I haven’t had any Linode alerts in a few days</li>
<li>Export IITA’s community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something</li>
<li>Reading an interesting <ahref="https://teaspoon-consulting.com/articles/solr-cache-tuning.html">blog post about Solr caching</a></li>
<li>Did some tests of the dspace-statistics-api on my local DSpace instance with 28 million documents in a sharded statistics core (<code>statistics</code> and <code>statistics-2018</code>) and monitored the memory usage of Tomcat in VisualVM</li>
<li>4GB heap, CMS GC, 512 filter cache, 512 query cache, with 28 million documents in two shards
<ul>
<li>Run 1:</li>
<li>Time: 3.11s user 0.44s system 0% cpu 13:45.07 total</li>
<li>Tomcat (not Solr) max JVM heap usage: 2.04 GiB</li>
<li>Run 2:</li>
<li>Time: 3.23s user 0.43s system 0% cpu 13:46.10 total</li>
<li>Tomcat (not Solr) max JVM heap usage: 2.06 GiB</li>
<li>Run 3:</li>
<li>Time: 3.23s user 0.42s system 0% cpu 13:14.70 total</li>
<li>Tomcat (not Solr) max JVM heap usage: 2.13 GiB</li>
<li>The biggest takeaway I have is that this workload benefits from a larger <code>filterCache</code> (for Solr fq parameter), but barely uses the <code>queryResultCache</code> (for Solr q parameter) at all
<ul>
<li>The number of hits goes up and the time taken decreases when we increase the <code>filterCache</code>, and total JVM heap memory doesn’t seem to increase much at all</li>
<li>I guess the <code>queryResultCache</code> size is always 2 because I’m only doing two queries: <code>type:0</code> and <code>type:2</code> (downloads and views, respectively)</li>
</ul></li>
<li>Here is the general pattern of running three sequential indexing runs as seen in VisualVM while monitoring the Tomcat process:</li>
<li>I ran one test with a <code>filterCache</code> of 16384 to try to see if I could make the Tomcat JVM memory balloon, but actually it <em>drastically</em> increased the performance and memory usage of the dspace-statistics-api indexer</li>
<li>4GB heap, CMS GC, 16384 filter cache, 512 query cache, with 28 million documents in two shards
<ul>
<li>Run 1:</li>
<li>Time: 2.85s user 0.42s system 2% cpu 2:28.92 total</li>
<li>I’ve been trying to copy the <code>statistics-2018</code> Solr core from CGSpace to DSpace Test since yesterday, but the network speed is like 20KiB/sec
<ul>
<li>I opened a support ticket to ask Linode to investigate</li>
<li>They asked me to send an <code>mtr</code> report from Fremont to Frankfurt and vice versa</li>
<li>Deploy Tomcat 7.0.94 on DSpace Test (linode19)
<ul>
<li>Also, I realized that the CMS GC changes I deployed a few days ago were ignored by Tomcat because of something with how Ansible formatted the options string</li>
<li>I needed to use the “folded” YAML variable format <code>>-</code> (with the dash so it doesn’t add a return at the end)</li>
<li>UptimeRobot says that CGSpace went “down” this afternoon, but I looked at the CPU steal with <code>iostat 1 10</code> and it’s in the 50s and 60s
<ul>
<li>The munin graph shows a lot of CPU steal (red) currently (and over all during the week):</li>
<li>Linode agreed to move CGSpace (linode18) to a new machine shortly after I filed my ticket about CPU steal two days ago and now the load is much more sane:</li>
<li>Abenet pointed out <ahref="https://hdl.handle.net/10568/97912">an item</a> that doesn’t have an Altmetric score on CGSpace, but has a score of 343 in the CGSpace Altmetric dashboard
<ul>
<li>I tweeted the Handle to see if it will pick it up…</li>
<li>Like clockwork, after fifteen minutes there was a donut showing on CGSpace</li>
<pre><code>2019-04-08 19:02:31,770 WARN org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
<li><p>I will fix it in <code>dspace/config/modules/oai.cfg</code></p></li>
<li><p>Linode says that it is likely that the host CGSpace (linode18) is on is showing signs of hardware failure and they recommended that I migrate the VM to a new host</p>
<li>One blog post says that there is <ahref="https://kvaes.wordpress.com/2017/07/01/what-azure-virtual-machine-size-should-i-pick/">no overprovisioning in Azure</a>:</li>
</ul>
<blockquotecite="https://kvaes.wordpress.com/2017/07/01/what-azure-virtual-machine-size-should-i-pick/">In Azure, with one exception being the A0, there is no overprovisioning… Each physical cores is only supporting one virtual core.</blockquote>
<ul>
<li>Perhaps that’s why the <ahref="https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/">Azure pricing</a> is so expensive!</li>
<li>Add a privacy page to CGSpace
<ul>
<li>The work was mostly similar to the About page at <code>/page/about</code>, but in addition to adding i18n strings etc, I had to add the logic for the trail to <code>dspace-xmlui-mirage2/src/main/webapp/xsl/preprocess/general.xsl</code></li>
<li>I copied the 18GB <code>statistics-2018</code> Solr core from Frankfurt to a Linode in London at 15MiB/sec, then from the London one to DSpace Test in Fremont at 15MiB/sec… so WTF us up with Frankfurt→Fremont?!</li>
<li>Finally upload the 218 IITA items from March to CGSpace
<ul>
<li>Abenet and I had to do a little bit more work to correct the metadata of one item that appeared to be a duplicate, but really just had the wrong DOI</li>
<li><p>While I was uploading the IITA records I noticed that twenty of the records Sisay uploaded in 2018-09 had double Handles (<code>dc.identifier.uri</code>)</p>
<li>I told him we never finished it, and that he should try to use the <code>/items/find-by-metadata-field</code> endpoint, with the caveat that you need to match the language attribute exactly (ie “en”, “en_US”, null, etc)</li>
<li>I asked him how many terms they are interested in, as we could probably make it easier by normalizing the language attributes of these fields (it would help us anyways)</li>
<li><p>Note that curl only shows the HTTP 401 error if you use <code>-f</code> (fail), and only then if you <em>don’t</em> include <code>-s</code></p>
<pre><code>dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='en_US';
dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='';
dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang IS NULL;
<li><p>I see that the HTTP 401 issue seems to be a bug due to an item that the user doesn’t have permission to access… from the DSpace log:</p>
<pre><code>2019-04-24 08:11:51,129 INFO org.dspace.rest.ItemsResource @ Looking for item with metadata(key=cg.subject.cpwf,value=WATER MANAGEMENT, language=en_US).
2019-04-24 08:11:51,231 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/72448
2019-04-24 08:11:51,238 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/72491
2019-04-24 08:11:51,243 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/75703
2019-04-24 08:11:51,252 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item!
<li><p>This is weird because I see 942–1156 items with “WATER MANAGEMENT” (depending on wildcard matching for errors in subject spelling):</p>
<li><p>I tried to delete the item in the web interface, and it seems successful, but I can still access the item in the admin interface, and nothing changes in PostgreSQL</p></li>
<li><p>Meet with CodeObia to see progress on AReS version 2</p></li>
<li><p>Marissa Van Epp asked me to add a few new metadata values to their Phase II Project Tags field (cg.identifier.ccafsprojectpii)</p>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-04-26-all-authors.csv with csv header;
<li><p>Still trying to figure out the issue with the items that cause the REST API’s <code>/items/find-by-metadata-value</code> endpoint to throw an exception</p>
<li><p>I even tried to “expunge” the item using an <ahref="https://wiki.duraspace.org/display/DSDOC5x/Batch+Metadata+Editing#BatchMetadataEditing-Performing'actions'onitems">action in CSV</a>, and it said “EXPUNGED!” but the item is still there…</p></li>
<li><p>In order to make it easier for him to understand the CSV I will normalize the text languages (minus the provenance field) on my local development instance before exporting:</p>
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
UPDATE 360295
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
UPDATE 1074345
dspace=# UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('es', 'spa');
<li><p>Then I exported the whole repository as CSV, imported it into OpenRefine, removed a few unneeded columns, exported it, zipped it down to 36MB, and emailed a link to Carlos</p></li>
<li><p>In other news, while I was looking through the CSV in OpenRefine I saw lots of weird values in some fields… we should check, for example:</p>