- I was trying to add Felix Shaw's account back to the Administrators group on DSpace Test, but I couldn't find his name in the user search of the groups page
- If I searched for "Felix" or "Shaw" I saw other matches, included one for his personal email address!
- I ended up finding him via searching for his email address
- The other thing visible there is that the past few days the load has spiked to 500% and I don't think it's a coincidence that the Solr updating thing is happening...
- I ran all system updates and rebooted the server
- Udana asked why item [10568/91278](https://cgspace.cgiar.org/handle/10568/91278) didn't have an Altmetric badge on CGSpace, but on the [WLE website](https://wle.cgiar.org/food-and-agricultural-innovation-pathways-prosperity) it does
- I looked and saw that the WLE website is using the Altmetric score associated with the DOI, and that the Handle has no score at all
- I tweeted the item and I assume this will link the Handle with the DOI in the system
- Linode sent an alert that there was high CPU usage this morning on CGSpace (linode18) and these were the top IPs in the webserver access logs around the time:
-`45.5.184.72` is in Colombia so it's probably CIAT, and I see they are indeed trying to get crawl the Discover pages on CIAT's datasets collection:
```
GET /handle/10568/72970/discover?filtertype_0=type&filtertype_1=author&filter_relational_operator_1=contains&filter_relational_operator_0=equals&filter_1=&filter_0=Dataset&filtertype=dateIssued&filter_relational_operator=equals&filter=2014
```
- Their user agent is the one I added to the badbots list in nginx last week: "GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1"
- They made 22,000 requests to Discover on this collection today alone (and it's only 11AM):
- Looking into the impact of harvesters like `45.5.184.72`, I see in Solr that this user is not categorized as a bot so it definitely impacts the usage stats by some tens of thousands *per day*
- I am trying to see if these are registered as downloads in Solr or not
- I see 96,925 downloads from their AWS gateway IPs in 2019-03:
```
$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-03&rows=0&wt=json&indent=true'
{
"response": {
"docs": [],
"numFound": 96925,
"start": 0
},
"responseHeader": {
"QTime": 1,
"params": {
"fq": [
"statistics_type:view",
"bundleName:ORIGINAL",
"dateYearMonth:2019-03"
],
"indent": "true",
"q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)",
"rows": "0",
"wt": "json"
},
"status": 0
}
}
```
- Strangely I don't see many hits in 2019-04:
```
$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-04&rows=0&wt=json&indent=true'
{
"response": {
"docs": [],
"numFound": 38,
"start": 0
},
"responseHeader": {
"QTime": 1,
"params": {
"fq": [
"statistics_type:view",
"bundleName:ORIGINAL",
"dateYearMonth:2019-04"
],
"indent": "true",
"q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)",
"rows": "0",
"wt": "json"
},
"status": 0
}
}
```
- Making some tests on GET vs HEAD requests on the [CTA Spore 192 item](https://dspacetest.cgiar.org/handle/10568/100289) on DSpace Test:
```
$ http --print Hh GET https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf
GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1
- After twenty minutes of waiting I still don't see any new requests in the statistics core, but when I try the requests from the command line again I see the following in the DSpace log:
```
2019-04-07 02:05:30,966 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=EF2DB6E4F69926C5555B3492BB0071A8:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
2019-04-07 02:05:39,265 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=B6381FC590A5160D84930102D068C7A3:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
```
- So my inclination is that both HEAD and GET requests are registered as views as far as Solr and DSpace are concerned
- Strangely, the statistics Solr core says it hasn't been modified in 24 hours, so I tried to start the "optimize" process from the Admin UI and I see this in the Solr log:
```
2019-04-07 02:08:44,186 INFO org.apache.solr.update.UpdateHandler @ start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
```
- Ugh, even after optimizing there are no Solr results for requests from my IP, and actually I only see 18 results from 2019-04 so far and none of them are `statistics_type:view`... very weird
- I don't even see many hits for days after 2019-03-17, when I migrated the server to Ubuntu 18.04 and copied the statistics core from CGSpace (linode18)
- I will try to re-deploy the `5_x-dev` branch and test again
- According to the [DSpace 5.x Solr documentation](https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics) the default commit time is after 15 minutes or 10,000 documents (see `solrconfig.xml`)
- I looped some GET and HEAD requests to a bitstream on my local instance and after some time I see that they *do* register as downloads (even though they are internal):
```
$ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&fq=statistics_type%3Aview&fq=isInternal%3Atrue&rows=0&wt=json&indent=true'
{
"response": {
"docs": [],
"numFound": 909,
"start": 0
},
"responseHeader": {
"QTime": 0,
"params": {
"fq": [
"statistics_type:view",
"isInternal:true"
],
"indent": "true",
"q": "type:0 AND time:2019-04-07*",
"rows": "0",
"wt": "json"
},
"status": 0
}
}
```
- I confirmed the same on CGSpace itself after making one HEAD request
- So I'm pretty sure it's something about DSpace Test using the CGSpace statistics core, and not that I deployed Solr 4.10.4 there last week
- I deployed Solr 4.10.4 locally and ran a bunch of requests for bitstreams and they do show up in the Solr statistics log, so the issue must be with re-using the existing Solr core from CGSpace
- Now this gets more frustrating: I did the same GET and HEAD tests on a local Ubuntu 16.04 VM with Solr 4.10.2 and 4.10.4 and the statistics are recorded
- This leads me to believe there is something specifically wrong with DSpace Test (linode19)
- The only thing I can think of is that the JVM is using G1GC instead of ConcMarkSweepGC
- Holy shit, all this is actually because of the GeoIP1 deprecation and a missing `GeoLiteCity.dat`
- For some reason the missing GeoIP data causes stats to not be recorded whatsoever and there is no error!
- DSpace 5.10 upgraded to use GeoIP2, but we are on 5.8 so I just copied the missing database file from another server because it has been *removed* from MaxMind's server as of 2018-04-01
- Now I made 100 requests and I see them in the Solr statistics... fuck my life for wasting five hours debugging this
- UptimeRobot said CGSpace went down and up a few times tonight, and my first instict was to check `iostat 1 10` and I saw that CPU steal is around 10–30 percent right now...
- The load average is super high right now, as I've noticed the last few times UptimeRobot said that CGSpace went down:
```
$ cat /proc/loadavg
10.70 9.17 8.85 18/633 4198
```
- According to the server logs there is actually not much going on right now:
- Start checking IITA's last round of batch uploads from [March on DSpace Test](https://dspacetest.cgiar.org/handle/10568/100333) (20193rd.xls)
- Lots of problems with affiliations, I had to correct about sixty of them
- I used lein to host the latest CSV of our affiliations for OpenRefine to reconcile against:
```
$ lein run ~/src/git/DSpace/2019-02-22-affiliations.csv name id
```
- After matching the values and creating some new matches I had trouble remembering how to copy the reconciled values to a new column
- The matched values can be accessed with `cell.recon.match.name`, but some of the new values don't appear, perhaps because I edited the original cell values?
- I ended up using this GREL expression to copy all values to a new column:
dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 211 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-04-08-top-1500-affiliations.csv WITH CSV HEADER;
COPY 1500
```
- Fix a few more messed up affiliations that have return characters in them (use Ctrl-V Ctrl-M to re-create control character):
```
dspace=# UPDATE metadatavalue SET text_value='International Institute for Environment and Development' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'International Institute^M%';
dspace=# UPDATE metadatavalue SET text_value='Kenya Agriculture and Livestock Research Organization' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'Kenya Agricultural and Livestock Research^M%';
```
- I noticed a bunch of subjects and affiliations that use stylized apostrophes so I will export those and then batch update them:
```
dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE '%’%') to /tmp/2019-04-08-affiliations-apostrophes.csv WITH CSV HEADER;
COPY 60
dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 57 AND text_value LIKE '%’%') to /tmp/2019-04-08-subject-apostrophes.csv WITH CSV HEADER;
COPY 20
```
- I cleaned them up in OpenRefine and then applied the fixes on CGSpace and DSpace Test:
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-319] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
- Linode Support still didn't respond to my ticket from yesterday, so I attached a new output of `iostat 1 10` and asked them to move the VM to a less busy host
- Leroy from CIAT contacted me to say he knows the team who is making all those requests to CGSpace
- I told them how to use the REST API to get the CIAT Datasets collection and enumerate its items
- In other news, Linode staff identified a noisy neighbor sharing our host and migrated it elsewhere last night
## 2019-04-10
- Abenet pointed out a possibility of validating funders against the [CrossRef API](https://support.crossref.org/hc/en-us/articles/215788143-Funder-data-via-the-API)
- Note that if you use HTTPS and specify a contact address in the API request you have less likelihood of being blocked
- Otherwise, they provide the funder data in [CSV and RDF format](https://www.crossref.org/services/funder-registry/)
- I did a quick test with the recent IITA records against reconcile-csv in OpenRefine and it matched a few, but the ones that didn't match will need a human to go and do some manual checking and informed decision making...
- I captured a few general corrections and deletions for AGROVOC subjects while looking at IITA's records, so I applied them to DSpace Test and CGSpace:
- Answer more questions about DOIs and Altmetric scores from WLE
- Answer more questions about DOIs and Altmetric scores from IWMI
- They can't seem to understand the Altmetric + Twitter flow for associating Handles and DOIs
- To make things worse, many of their items DON'T have DOIs, so when Altmetric harvests them of course there is no link! - Then, a bunch of their items don't have scores because they never tweeted them!
- They added a DOI to this old item [10567/97087](https://cgspace.cgiar.org/handle/10568/97087) this morning and wonder why Altmetric's score hasn't linked with the DOI magically
- We should check in a week to see if Altmetric will make the association after one week when they harvest again
- I copied the `statistics` and `statistics-2018` Solr cores from CGSpace to my local machine and watched the Java process in VisualVM while indexing item views and downloads with my [dspace-statistics-api](https://github.com/ilri/dspace-statistics-api):
![Java GC during Solr indexing with CMS](/cgspace-notes/2019/04/visualvm-solr-indexing.png)
- It took about eight minutes to index 784 pages of item views and 268 of downloads, and you can see a clear "sawtooth" pattern in the garbage collection
- I am curious if the GC pattern would be different if I switched from the `-XX:+UseConcMarkSweepGC` to G1GC
- I switched to G1GC and restarted Tomcat but for some reason I couldn't see the Tomcat PID in VisualVM...
- Anyways, the indexing process took much longer, perhaps twice as long!
- I tried again with the GC tuning settings from the Solr 4.10.4 release:
![Java GC during Solr indexing Solr 4.10.4 settings](/cgspace-notes/2019/04/visualvm-solr-indexing-solr-settings.png)
- Reading an interesting [blog post about Solr caching](https://teaspoon-consulting.com/articles/solr-cache-tuning.html)
- Did some tests of the dspace-statistics-api on my local DSpace instance with 28 million documents in a sharded statistics core (`statistics` and `statistics-2018`) and monitored the memory usage of Tomcat in VisualVM
- 4GB heap, CMS GC, 512 filter cache, 512 query cache, with 28 million documents in two shards
- Run 1:
- Time: 3.11s user 0.44s system 0% cpu 13:45.07 total
- Tomcat (not Solr) max JVM heap usage: 2.04 GiB
- Run 2:
- Time: 3.23s user 0.43s system 0% cpu 13:46.10 total
- Tomcat (not Solr) max JVM heap usage: 2.06 GiB
- Run 3:
- Time: 3.23s user 0.42s system 0% cpu 13:14.70 total
- The biggest takeaway I have is that this workload benefits from a larger `filterCache` (for Solr fq parameter), but barely uses the `queryResultCache` (for Solr q parameter) at all
- The number of hits goes up and the time taken decreases when we increase the `filterCache`, and total JVM heap memory doesn't seem to increase much at all
- I guess the `queryResultCache` size is always 2 because I'm only doing two queries: `type:0` and `type:2` (downloads and views, respectively)
- Here is the general pattern of running three sequential indexing runs as seen in VisualVM while monitoring the Tomcat process:
- I ran one test with a `filterCache` of 16384 to try to see if I could make the Tomcat JVM memory balloon, but actually it *drastically* increased the performance and memory usage of the dspace-statistics-api indexer
- 4GB heap, CMS GC, 16384 filter cache, 512 query cache, with 28 million documents in two shards
- Run 1:
- Time: 2.85s user 0.42s system 2% cpu 2:28.92 total
- Also, I realized that the CMS GC changes I deployed a few days ago were ignored by Tomcat because of something with how Ansible formatted the options string
- I needed to use the "folded" YAML variable format `>-` (with the dash so it doesn't add a return at the end)
- Linode agreed to move CGSpace (linode18) to a new machine shortly after I filed my ticket about CPU steal two days ago and now the load is much more sane:
- Abenet pointed out [an item](https://hdl.handle.net/10568/97912) that doesn't have an Altmetric score on CGSpace, but has a score of 343 in the CGSpace Altmetric dashboard
- I tweeted the Handle to see if it will pick it up...
- Like clockwork, after fifteen minutes there was a donut showing on CGSpace
- I want to get rid of this annoying warning that is constantly in our DSpace logs:
```
2019-04-08 19:02:31,770 WARN org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
```
- Apparently it happens once per request, which can be at least 1,500 times per day according to the DSpace logs on CGSpace (linode18):
```
$ grep -c 'Falling back to request address' dspace.log.2019-04-20
dspace.log.2019-04-20:1515
```
- I will fix it in `dspace/config/modules/oai.cfg`
- Linode says that it is likely that the host CGSpace (linode18) is on is showing signs of hardware failure and they recommended that I migrate the VM to a new host
- I told them to migrate it at 04:00:00AM Frankfurt time, when nobody in East Africa, Europe, or South America should be using the server
- One blog post says that there is [no overprovisioning in Azure](https://kvaes.wordpress.com/2017/07/01/what-azure-virtual-machine-size-should-i-pick/):
<blockquotecite="https://kvaes.wordpress.com/2017/07/01/what-azure-virtual-machine-size-should-i-pick/">In Azure, with one exception being the A0, there is no overprovisioning… Each physical cores is only supporting one virtual core.</blockquote>
- Perhaps that's why the [Azure pricing](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/) is so expensive!
- Add a privacy page to CGSpace
- The work was mostly similar to the About page at `/page/about`, but in addition to adding i18n strings etc, I had to add the logic for the trail to `dspace-xmlui-mirage2/src/main/webapp/xsl/preprocess/general.xsl`
- I copied the 18GB `statistics-2018` Solr core from Frankfurt to a Linode in London at 15MiB/sec, then from the London one to DSpace Test in Fremont at 15MiB/sec... so WTF us up with Frankfurt→Fremont?!
- Carlos Tejo from the Land Portal had been emailing me this week to ask about the old REST API that Tsega was building in 2017
- I told him we never finished it, and that he should try to use the `/items/find-by-metadata-field` endpoint, with the caveat that you need to match the language attribute exactly (ie "en", "en_US", null, etc)
- I asked him how many terms they are interested in, as we could probably make it easier by normalizing the language attributes of these fields (it would help us anyways)
- Note that curl only shows the HTTP 401 error if you use `-f` (fail), and only then if you *don't* include `-s`
- I see there are about 1,000 items using CPWF subject "WATER MANAGEMENT" in the database, so there should definitely be results
- The breakdown of `text_lang` fields used in those items is 942:
```
dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='en_US';
count
-------
376
(1 row)
dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='';
count
-------
149
(1 row)
dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang IS NULL;
count
-------
417
(1 row)
```
- I see that the HTTP 401 issue seems to be a bug due to an item that the user doesn't have permission to access... from the DSpace log:
```
2019-04-24 08:11:51,129 INFO org.dspace.rest.ItemsResource @ Looking for item with metadata(key=cg.subject.cpwf,value=WATER MANAGEMENT, language=en_US).
2019-04-24 08:11:51,231 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/72448
2019-04-24 08:11:51,238 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/72491
2019-04-24 08:11:51,243 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/75703
2019-04-24 08:11:51,252 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item!
```
- Nevertheless, if I request using the `null` language I get 1020 results, plus 179 for a blank language attribute:
- I tried to delete the item in the web interface, and it seems successful, but I can still access the item in the admin interface, and nothing changes in PostgreSQL
- Export a list of authors for Peter to look through:
```
dspacetest=# # \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-04-26-all-authors.csv with csv header;
74648 | 113 | f | f | 2019-04-28 08:48:52.114-07 | | f
(1 row)
```
- And I tried the `curl` command from above again, but I still get the HTTP 401 and and the same error in the DSpace log:
```
2019-04-28 08:53:07,170 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item(id=74648)!
```
- I even tried to "expunge" the item using an [action in CSV](https://wiki.duraspace.org/display/DSDOC5x/Batch+Metadata+Editing#BatchMetadataEditing-Performing'actions'onitems), and it said "EXPUNGED!" but the item is still there...
- Carlos from LandPortal asked if I could export CGSpace in a machine-readable format so I think I'll try to do a CSV
- In order to make it easier for him to understand the CSV I will normalize the text languages (minus the provenance field) on my local development instance before exporting:
```
dspace=# SELECT DISTINCT text_lang, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id != 28 GROUP BY text_lang;
text_lang | count
-----------+---------
| 358647
* | 11
E. | 1
en | 1635
en_US | 602312
es | 12
es_ES | 2
ethnob | 1
fr | 2
spa | 2
| 1074345
(11 rows)
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
UPDATE 360295
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
UPDATE 1074345
dspace=# UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('es', 'spa');
UPDATE 14
```
- Then I exported the whole repository as CSV, imported it into OpenRefine, removed a few unneeded columns, exported it, zipped it down to 36MB, and emailed a link to Carlos
- In other news, while I was looking through the CSV in OpenRefine I saw lots of weird values in some fields... we should check, for example: