mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-10-31 20:33:00 +01:00
314 lines
14 KiB
Markdown
314 lines
14 KiB
Markdown
+++
|
||
date = "2016-11-01T09:21:00+03:00"
|
||
author = "Alan Orth"
|
||
title = "November, 2016"
|
||
tags = ["Notes"]
|
||
|
||
+++
|
||
## 2016-11-01
|
||
|
||
- Add `dc.type` to the output options for Atmire's Listings and Reports module ([#286](https://github.com/ilri/DSpace/pull/286))
|
||
|
||
![Listings and Reports with output type](2016/11/listings-and-reports.png)
|
||
|
||
## 2016-11-02
|
||
|
||
- Migrate DSpace Test to DSpace 5.5 ([notes](https://gist.github.com/alanorth/61013895c6efe7095d7f81000953d1cf))
|
||
- Run all updates on DSpace Test and reboot the server
|
||
- Looks like the OAI bug from DSpace 5.1 that caused validation at Base Search to fail is now fixed and DSpace Test passes validation! ([#63](https://github.com/ilri/DSpace/issues/63))
|
||
- Indexing Discovery on DSpace Test took 332 minutes, which is like five times as long as it usually takes
|
||
- At the end it appeared to finish correctly but there were lots of errors right after it finished:
|
||
|
||
```
|
||
2016-11-02 15:09:48,578 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76454 to Index
|
||
2016-11-02 15:09:48,584 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Community: 10568/3202 to Index
|
||
2016-11-02 15:09:48,589 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76455 to Index
|
||
2016-11-02 15:09:48,590 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Community: 10568/51693 to Index
|
||
2016-11-02 15:09:48,590 INFO org.dspace.discovery.IndexClient @ Done with indexing
|
||
2016-11-02 15:09:48,600 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76456 to Index
|
||
2016-11-02 15:09:48,613 INFO org.dspace.discovery.SolrServiceImpl @ Wrote Item: 10568/55536 to Index
|
||
2016-11-02 15:09:48,616 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76457 to Index
|
||
2016-11-02 15:09:48,634 ERROR com.atmire.dspace.discovery.AtmireSolrService @
|
||
java.lang.NullPointerException
|
||
at org.dspace.discovery.SearchUtils.getDiscoveryConfiguration(SourceFile:57)
|
||
at org.dspace.discovery.SolrServiceImpl.buildDocument(SolrServiceImpl.java:824)
|
||
at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:821)
|
||
at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:898)
|
||
at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370)
|
||
at org.dspace.storage.rdbms.DatabaseUtils$ReindexerThread.run(DatabaseUtils.java:945)
|
||
```
|
||
|
||
- DSpace is still up, and a few minutes later I see the default DSpace indexer is still running
|
||
- Sure enough, looking back before the first one finished, I see output from both indexers interleaved in the log:
|
||
|
||
```
|
||
2016-11-02 15:09:28,545 INFO org.dspace.discovery.SolrServiceImpl @ Wrote Item: 10568/47242 to Index
|
||
2016-11-02 15:09:28,633 INFO org.dspace.discovery.SolrServiceImpl @ Wrote Item: 10568/60785 to Index
|
||
2016-11-02 15:09:28,678 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (55695 of 55722): 43557
|
||
2016-11-02 15:09:28,688 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (55703 of 55722): 34476
|
||
```
|
||
|
||
- I will raise a ticket with Atmire to ask them
|
||
|
||
## 2016-11-06
|
||
|
||
- After re-deploying and re-indexing I didn't see the same issue, and the indexing completed in 85 minutes, which is about how long it is supposed to take
|
||
|
||
## 2016-11-07
|
||
|
||
- Horrible one liner to get Linode ID from certain Ansible host vars:
|
||
|
||
```
|
||
$ grep -A 3 contact_info * | grep -E "(Orth|Sisay|Peter|Daniel|Tsega)" | awk -F'-' '{print $1}' | grep linode | uniq | xargs grep linode_id
|
||
```
|
||
|
||
- I noticed some weird CRPs in the database, and they don't show up in Discovery for some reason, perhaps the `:`
|
||
- I'll export these and fix them in batch:
|
||
|
||
```
|
||
dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=230 group by text_value order by count desc) to /tmp/crp.csv with csv;
|
||
COPY 22
|
||
```
|
||
|
||
- Test running the replacements:
|
||
|
||
```
|
||
$ ./fix-metadata-values.py -i /tmp/CRPs.csv -f cg.contributor.crp -t correct -m 230 -d dspace -u dspace -p 'fuuu'
|
||
```
|
||
|
||
- Add `AMR` to ILRI subjects and remove one duplicate instance of IITA in author affiliations controlled vocabulary ([#288](https://github.com/ilri/DSpace/pull/288))
|
||
|
||
## 2016-11-08
|
||
|
||
- Atmire's Listings and Reports module seems to be broken on DSpace 5.5
|
||
|
||
![Listings and Reports broken in DSpace 5.5](2016/11/listings-and-reports-55.png)
|
||
|
||
- I've filed a ticket with Atmire
|
||
- Thinking about batch updates for ORCIDs and authors
|
||
- Playing with [SolrClient](https://github.com/moonlitesolutions/SolrClient) in Python to query Solr
|
||
- All records in the authority core are either `authority_type:orcid` or `authority_type:person`
|
||
- There is a `deleted` field and all items seem to be `false`, but might be important sanity check to remember
|
||
- The way to go is probably to have a CSV of author names and authority IDs, then to batch update them in PostgreSQL
|
||
- Dump of the top ~200 authors in CGSpace:
|
||
|
||
```
|
||
dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=3 group by text_value order by count desc limit 210) to /tmp/210-authors.csv with csv;
|
||
```
|
||
|
||
## 2016-11-09
|
||
|
||
- CGSpace crashed so I quickly ran system updates, applied one or two of the waiting changes from the `5_x-prod` branch, and rebooted the server
|
||
- The error was `Timeout waiting for idle object` but I haven't looked into the Tomcat logs to see what happened
|
||
- Also, I ran the corrections for CRPs from earlier this week
|
||
|
||
## 2016-11-10
|
||
|
||
- Helping Megan Zandstra and CIAT with some questions about the REST API
|
||
- Playing with `find-by-metadata-field`, this works:
|
||
|
||
```
|
||
$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}'
|
||
```
|
||
|
||
- But the results are deceiving because metadata fields can have text languages and your query must match exactly!
|
||
|
||
```
|
||
dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
|
||
text_value | text_lang
|
||
------------+-----------
|
||
SEEDS |
|
||
SEEDS |
|
||
SEEDS | en_US
|
||
(3 rows)
|
||
```
|
||
|
||
- So basically, the text language here could be null, blank, or en_US
|
||
- To query metadata with these properties, you can do:
|
||
|
||
```
|
||
$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}' | jq length
|
||
55
|
||
$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":""}' | jq length
|
||
34
|
||
$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":"en_US"}' | jq length
|
||
```
|
||
|
||
- The results (55+34=89) don't seem to match those from the database:
|
||
|
||
```
|
||
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang is null;
|
||
count
|
||
-------
|
||
15
|
||
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang='';
|
||
count
|
||
-------
|
||
4
|
||
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang='en_US';
|
||
count
|
||
-------
|
||
66
|
||
```
|
||
|
||
- So, querying from the API I get 55 + 34 = 89 results, but the database actually only has 85...
|
||
- And the `find-by-metadata-field` endpoint doesn't seem to have a way to get all items with the field, or a wildcard value
|
||
- I'll ask a question on the dspace-tech mailing list
|
||
- And speaking of `text_lang`, this is interesting:
|
||
|
||
```
|
||
dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2;
|
||
text_lang
|
||
-----------
|
||
|
||
ethnob
|
||
en
|
||
spa
|
||
EN
|
||
es
|
||
frn
|
||
en_
|
||
en_US
|
||
|
||
EN_US
|
||
eng
|
||
en_U
|
||
fr
|
||
(14 rows)
|
||
```
|
||
|
||
- Generate a list of all these so I can maybe fix them in batch:
|
||
|
||
```
|
||
dspace=# \copy (select distinct text_lang, count(*) from metadatavalue where resource_type_id=2 group by text_lang order by count desc) to /tmp/text-langs.csv with csv;
|
||
COPY 14
|
||
```
|
||
|
||
- Perhaps we need to fix them all in batch, or experiment with fixing only certain metadatavalues:
|
||
|
||
```
|
||
dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
|
||
UPDATE 85
|
||
```
|
||
|
||
- The `fix-metadata.py` script I have is meant for specific metadata values, so if I want to update some `text_lang` values I should just do it directly in the database
|
||
- For example, on a limited set:
|
||
|
||
```
|
||
dspace=# update metadatavalue set text_lang=NULL where resource_type_id=2 and metadata_field_id=203 and text_value='LIVESTOCK' and text_lang='';
|
||
UPDATE 420
|
||
```
|
||
|
||
- And assuming I want to do it for all fields:
|
||
|
||
```
|
||
dspacetest=# update metadatavalue set text_lang=NULL where resource_type_id=2 and text_lang='';
|
||
UPDATE 183726
|
||
```
|
||
|
||
- After that restarted Tomcat and PostgreSQL (because I'm superstitious about caches) and now I see the following in REST API query:
|
||
|
||
```
|
||
$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}' | jq length
|
||
71
|
||
$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":""}' | jq length
|
||
0
|
||
$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":"en_US"}' | jq length
|
||
```
|
||
|
||
- Not sure what's going on, but Discovery shows 83 values, and database shows 85, so I'm going to reindex Discovery just in case
|
||
|
||
## 2016-11-14
|
||
|
||
- I applied Atmire's suggestions to fix Listings and Reports for DSpace 5.5 and now it works
|
||
- There were some issues with the `dspace/modules/jspui/pom.xml`, which is annoying because all I did was rebase our working 5.1 code on top of 5.5, meaning Atmire's installation procedure must have changed
|
||
- So there is apparently this Tomcat native way to limit web crawlers to one session: [Crawler Session Manager](https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve)
|
||
- After adding that to `server.xml` bots matching the pattern in the configuration will all use ONE session, just like normal users:
|
||
|
||
```
|
||
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
|
||
HTTP/1.1 200 OK
|
||
Connection: keep-alive
|
||
Content-Encoding: gzip
|
||
Content-Language: en-US
|
||
Content-Type: text/html;charset=utf-8
|
||
Date: Mon, 14 Nov 2016 19:47:29 GMT
|
||
Server: nginx
|
||
Set-Cookie: JSESSIONID=323694E079A53D5D024F839290EDD7E8; Path=/; Secure; HttpOnly
|
||
Transfer-Encoding: chunked
|
||
Vary: Accept-Encoding
|
||
X-Cocoon-Version: 2.2.0
|
||
X-Robots-Tag: none
|
||
|
||
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
|
||
HTTP/1.1 200 OK
|
||
Connection: keep-alive
|
||
Content-Encoding: gzip
|
||
Content-Language: en-US
|
||
Content-Type: text/html;charset=utf-8
|
||
Date: Mon, 14 Nov 2016 19:47:35 GMT
|
||
Server: nginx
|
||
Transfer-Encoding: chunked
|
||
Vary: Accept-Encoding
|
||
X-Cocoon-Version: 2.2.0
|
||
```
|
||
|
||
- The first one gets a session, and any after that — within 60 seconds — will be internally mapped to the same session by Tomcat
|
||
- This means that when Google or Baidu slam you with tens of concurrent connections they will all map to ONE internal session, which saves RAM!
|
||
|
||
## 2016-11-15
|
||
|
||
- The Tomcat JVM heap looks really good after applying the Crawler Session Manager fix on DSpace Test last night:
|
||
|
||
![Tomcat JVM heap (day) after setting up the Crawler Session Manager](2016/11/dspacetest-tomcat-jvm-day.png)
|
||
![Tomcat JVM heap (week) after setting up the Crawler Session Manager](2016/11/dspacetest-tomcat-jvm-week.png)
|
||
|
||
- Seems the default regex doesn't catch Baidu, though:
|
||
|
||
```
|
||
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
|
||
HTTP/1.1 200 OK
|
||
Connection: keep-alive
|
||
Content-Encoding: gzip
|
||
Content-Language: en-US
|
||
Content-Type: text/html;charset=utf-8
|
||
Date: Tue, 15 Nov 2016 08:49:54 GMT
|
||
Server: nginx
|
||
Set-Cookie: JSESSIONID=131409D143E8C01DE145C50FC748256E; Path=/; Secure; HttpOnly
|
||
Transfer-Encoding: chunked
|
||
Vary: Accept-Encoding
|
||
X-Cocoon-Version: 2.2.0
|
||
|
||
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
|
||
HTTP/1.1 200 OK
|
||
Connection: keep-alive
|
||
Content-Encoding: gzip
|
||
Content-Language: en-US
|
||
Content-Type: text/html;charset=utf-8
|
||
Date: Tue, 15 Nov 2016 08:49:59 GMT
|
||
Server: nginx
|
||
Set-Cookie: JSESSIONID=F6403C084480F765ED787E41D2521903; Path=/; Secure; HttpOnly
|
||
Transfer-Encoding: chunked
|
||
Vary: Accept-Encoding
|
||
X-Cocoon-Version: 2.2.0
|
||
```
|
||
|
||
- Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:
|
||
|
||
```
|
||
<!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers -->
|
||
<Valve className="org.apache.catalina.valves.CrawlerSessionManagerValve"
|
||
crawlerUserAgents=".*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*" />
|
||
```
|
||
|
||
- Looking at the bots that were active yesterday it seems the above regex should be sufficient:
|
||
|
||
```
|
||
$ grep -o -E 'Mozilla/5\.0 \(compatible;.*\"' /var/log/nginx/access.log | sort | uniq
|
||
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" "-"
|
||
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
|
||
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
|
||
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" "-"
|
||
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)" "-"
|
||
```
|