A few users noticed that CGSpace wasn’t loading items today, item pages seem blank
I looked at the PostgreSQL locks but they don’t seem unusual
I guess this is the same “blank item page” issue that we had a few times in 2019 that we never solved
I restarted Tomcat and PostgreSQL and the issue was gone
Since I was restarting Tomcat anyways I decided to redeploy the latest changes from the 5_x-prod branch and I added a note about COVID-19 items to the CGSpace frontpage at Peter’s request
A few users noticed that CGSpace wasn’t loading items today, item pages seem blank
I looked at the PostgreSQL locks but they don’t seem unusual
I guess this is the same “blank item page” issue that we had a few times in 2019 that we never solved
I restarted Tomcat and PostgreSQL and the issue was gone
Since I was restarting Tomcat anyways I decided to redeploy the latest changes from the 5_x-prod branch and I added a note about COVID-19 items to the CGSpace frontpage at Peter’s request
<li>A few users noticed that CGSpace wasn’t loading items today, item pages seem blank
<ul>
<li>I looked at the PostgreSQL locks but they don’t seem unusual</li>
<li>I guess this is the same “blank item page” issue that we had a few times in 2019 that we never solved</li>
<li>I restarted Tomcat and PostgreSQL and the issue was gone</li>
</ul>
</li>
<li>Since I was restarting Tomcat anyways I decided to redeploy the latest changes from the <code>5_x-prod</code> branch and I added a note about COVID-19 items to the CGSpace frontpage at Peter’s request</li>
</ul>
<ul>
<li>Also, Linode is alerting that we had high outbound traffic rate early this morning around midnight AND high CPU load later in the morning</li>
<li>First looking at the traffic in the morning:</li>
<li>64.39.99.13 belongs to Qualys, but I see they are using a normal desktop user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15
</code></pre><ul>
<li>I will purge hits from that IP from Solr</li>
<li>The 199.47.87.x IPs belong to Turnitin, and apparently they are NOT marked as bots and we have 40,000 hits from them in 2020 statistics alone:</li>
<li>They used to be “TurnitinBot”… hhmmmm, seems they use both: <ahref="https://turnitin.com/robot/crawlerinfo.html">https://turnitin.com/robot/crawlerinfo.html</a></li>
<li>I will add Turnitin to the DSpace bot user agent list, but I see they are reqesting <code>robots.txt</code> and only requesting item pages, so that’s impressive! I don’t need to add them to the “bad bot” rate limit list in nginx</li>
<li>While looking at the logs I noticed eighty-one IPs in the range 185.152.250.x making little requests this user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:76.0) Gecko/20100101 Firefox/76.0
<li>It’s only a few hundred requests each, but I am very suspicious so I will record it here and purge their IPs from Solr</li>
<li>Then I see 185.187.30.14 and 185.187.30.13 making requests also, with several different “normal” user agents
<ul>
<li>They are both apparently in France, belonging to Scalair FR hosting</li>
<li>I will purge their requests from Solr too</li>
</ul>
</li>
<li>Now I see some other new bots I hadn’t noticed before:
<ul>
<li><code>Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0) LinkCheck by Siteimprove.com</code></li>
<li><code>Consilio (WebHare Platform 4.28.2-dev); LinkChecker)</code>, which appears to be a <ahref="https://www.utwente.nl/en/websites/webhare/">university CMS</a></li>
<li>I will add <code>LinkCheck</code>, <code>Consilio</code>, and <code>WebHare</code> to the list of DSpace bot agents and purge them from Solr stats</li>
<li>COUNTER-Robots list already has <code>link.?check</code> but for some reason DSpace didn’t match that and I see hits for some of these…</li>
<li>Maybe I should add <code>[Ll]ink.?[Cc]heck.?</code> to a custom list for now?</li>
<li>For now I added <code>Turnitin</code> to the <ahref="https://github.com/atmire/COUNTER-Robots/pull/34">new bots pull request on COUNTER-Robots</a></li>
</ul>
</li>
<li>I purged 20,000 hits from IPs and 45,000 hits from user agents</li>
<li>I will revert the default “example” agents file back to the upstream master branch of COUNTER-Robots, and then add all my custom ones that are pending in pull requests they haven’t merged yet:</li>
<pre><code> [java] java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'DefaultStorageUpdateConfig': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire method: public void com.atmire.statistics.util.StorageReportsUpdater.setStorageReportServices(java.util.List); nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'cuaEPersonStorageReportService': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO com.atmire.dspace.cua.CUAStorageReportServiceImpl$CUAEPersonStorageReportServiceImpl.CUAEPersonStorageReportDAO; nested exception is org.springframework.beans.factory.NoUniqueBeanDefinitionException: No qualifying bean of type [com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO] is defined: expected single matching bean but found 2: com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#0,com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#1
<li>I had told Atmire about this several weeks ago… but I reminded them again in the ticket
<ul>
<li>Atmire says they are able to build fine, so I tried again and noticed that I had been building with <code>-Denv=dspacetest.cgiar.org</code>, which is not necessary for DSpace 6 of course</li>
<li>Once I removed that it builds fine</li>
</ul>
</li>
<li>I quickly re-applied the Font Awesome 5 changes to use SVG+JS instead of web fonts (from 2020-04) and things are looking good!</li>
<li>I need to export some Solr statistics data from CGSpace to test Salem’s modifications to the dspace-statistics-api
<ul>
<li>He modified it to query Solr on the fly instead of indexing it, which will be heavier and slower, but allows us to get more granular stats and countries/cities</li>
<li>Because have so many records I want to use solr-import-export-json to get several months at a time with a date range, but it seems there are first issues with curl (need to disable globbing with <code>-g</code> and URL encode the range)</li>
<li>For reference, the <ahref="https://lucene.apache.org/solr/4_10_2/solr-core/org/apache/solr/schema/DateField.html">Solr 4.10.x DateField docs</a></li>
<li>This range works in Solr UI: <code>[2019-01-01T00:00:00Z TO 2019-06-30T23:59:59Z]</code></li>
<li>But not in solr-import-export-json… hmmm… seems we need to URL encode <em>only</em> the date range itself, but not the brackets:</li>
<li>Import twelve items into the <ahref="https://hdl.handle.net/10568/97076">CRP Livestock multimedia</a> collection for Peter Ballantyne
<ul>
<li>I ran the data through csv-metadata-quality first to validate and fix some common mistakes</li>
<li>Interesting to check the data with <code>csvstat</code> to see if there are any duplicates</li>
</ul>
</li>
<li>Peter recently asked me to add Target audience (<code>cg.targetaudience</code>) to the CGSpace sidebar facets and AReS filters
<ul>
<li>I added it on my local DSpace test instance, but I’m waiting for him to tell me what he wants the header to be “Audiences” or “Target audience” etc…</li>
</ul>
</li>
<li>Peter also asked me to increase the size of links in the CGSpace “Welcome” text
<ul>
<li>I suggested using the CSS <code>font-size: larger</code> property to just bump it up one relative to what it already is</li>
<li>He said it looks good, but that actually now the links seem OK (I told him to refresh, as I had made them bold a few days ago) so we don’t need to adjust it actually</li>
</ul>
</li>
<li>Mohammed Salem modified my <ahref="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a> to query Solr directly so I started writing a script to benchmark it today
<ul>
<li>I will monitor the JVM memory and CPU usage in visualvm, just like I did in 2019-04</li>
<li>I noticed an issue with his limit parameter so I sent him some feedback on that in the meantime</li>
<li>I noticed that we have 20,000 distinct values for <code>dc.subject</code>, but there are at least 500 that are lower or mixed case that we should simply uppercase without further thought:</li>
<pre><code>dspace=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
</code></pre><ul>
<li>DSpace Test needs a different query because it is running DSpace 6 with UUIDs for everything:</li>
</ul>
<pre><code>dspace63=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
</code></pre><ul>
<li>Note the use of the POSIX character class :)</li>
<li>I suggest that we generate a list of the top 5,000 values that don’t match AGROVOC so that Sisay can correct them
<ul>
<li>Start by getting the top 6,500 subjects (assuming that the top ~1,500 are valid from our previous work):</li>
</ul>
</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=57 GROUP BY text_value ORDER BY count DESC) TO /tmp/2020-07-05-subjects.csv WITH CSV;
COPY 19640
dspace=# \q
$ csvcut -c1 /tmp/2020-07-05-subjects-upper.csv | head -n 6500 > 2020-07-05-cgspace-subjects.txt
</code></pre><ul>
<li>Then start looking them up using <code>agrovoc-lookup.py</code>:</li>
<li>I made some optimizations to the suite of Python utility scripts in our DSpace directory as well as the <ahref="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a> script
<ul>
<li>Mostly to make more efficient usage of the requests cache and to use parameterized requests instead of building the request URL by concatenating the URL with query parameters</li>
</ul>
</li>
<li>I modified the <code>agrovoc-lookup.py</code> script to save its results as a CSV, with the subject, language, type of match (preferred, alternate, and total number of matches) rather than save two separate files
<ul>
<li>Note that I see <code>prefLabel</code>, <code>matchedPrefLabel</code>, and <code>altLabel</code> in the REST API responses and I’m not sure what the second one means</li>