mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-12-18 11:12:18 +01:00
44 KiB
44 KiB
title | date | author | categories | |
---|---|---|---|---|
July, 2020 | 2020-07-01T10:53:54+03:00 | Alan Orth |
|
2020-07-01
- A few users noticed that CGSpace wasn't loading items today, item pages seem blank
- I looked at the PostgreSQL locks but they don't seem unusual
- I guess this is the same "blank item page" issue that we had a few times in 2019 that we never solved
- I restarted Tomcat and PostgreSQL and the issue was gone
- Since I was restarting Tomcat anyways I decided to redeploy the latest changes from the
5_x-prod
branch and I added a note about COVID-19 items to the CGSpace frontpage at Peter's request
- Also, Linode is alerting that we had high outbound traffic rate early this morning around midnight AND high CPU load later in the morning
- First looking at the traffic in the morning:
# cat /var/log/nginx/*.log.1 /var/log/nginx/*.log | grep -E "01/Jul/2020:(00|01|02|03|04)" | goaccess --log-format=COMBINED -
...
9659 33.56% 1 0.08% 340.94 MiB 64.39.99.13
3317 11.53% 1 0.08% 871.71 MiB 199.47.87.140
2986 10.38% 1 0.08% 17.39 MiB 199.47.87.144
2286 7.94% 1 0.08% 13.04 MiB 199.47.87.142
- 64.39.99.13 belongs to Qualys, but I see they are using a normal desktop user agent:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15
- I will purge hits from that IP from Solr
- The 199.47.87.x IPs belong to Turnitin, and apparently they are NOT marked as bots and we have 40,000 hits from them in 2020 statistics alone:
$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:/Turnitin.*/&rows=0" | grep -oE 'numFound="[0-9]+"'
numFound="41694"
- They used to be "TurnitinBot"... hhmmmm, seems they use both: https://turnitin.com/robot/crawlerinfo.html
- I will add Turnitin to the DSpace bot user agent list, but I see they are reqesting
robots.txt
and only requesting item pages, so that's impressive! I don't need to add them to the "bad bot" rate limit list in nginx - While looking at the logs I noticed eighty-one IPs in the range 185.152.250.x making little requests this user agent:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:76.0) Gecko/20100101 Firefox/76.0
- The IPs all belong to HostRoyale:
# cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep '01/Jul/2020' | awk '{print $1}' | grep 185.152.250. | sort | uniq | wc -l
81
# cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep '01/Jul/2020' | awk '{print $1}' | grep 185.152.250. | sort | uniq | sort -h
185.152.250.1
185.152.250.101
185.152.250.103
185.152.250.105
185.152.250.107
185.152.250.111
185.152.250.115
185.152.250.119
185.152.250.121
185.152.250.123
185.152.250.125
185.152.250.129
185.152.250.13
185.152.250.131
185.152.250.133
185.152.250.135
185.152.250.137
185.152.250.141
185.152.250.145
185.152.250.149
185.152.250.153
185.152.250.155
185.152.250.157
185.152.250.159
185.152.250.161
185.152.250.163
185.152.250.165
185.152.250.167
185.152.250.17
185.152.250.171
185.152.250.183
185.152.250.189
185.152.250.191
185.152.250.197
185.152.250.201
185.152.250.205
185.152.250.209
185.152.250.21
185.152.250.213
185.152.250.217
185.152.250.219
185.152.250.221
185.152.250.223
185.152.250.225
185.152.250.227
185.152.250.229
185.152.250.231
185.152.250.233
185.152.250.235
185.152.250.239
185.152.250.243
185.152.250.247
185.152.250.249
185.152.250.25
185.152.250.251
185.152.250.253
185.152.250.255
185.152.250.27
185.152.250.29
185.152.250.3
185.152.250.31
185.152.250.39
185.152.250.41
185.152.250.47
185.152.250.5
185.152.250.59
185.152.250.63
185.152.250.65
185.152.250.67
185.152.250.7
185.152.250.71
185.152.250.73
185.152.250.77
185.152.250.81
185.152.250.85
185.152.250.89
185.152.250.9
185.152.250.93
185.152.250.95
185.152.250.97
185.152.250.99
- It's only a few hundred requests each, but I am very suspicious so I will record it here and purge their IPs from Solr
- Then I see 185.187.30.14 and 185.187.30.13 making requests also, with several different "normal" user agents
- They are both apparently in France, belonging to Scalair FR hosting
- I will purge their requests from Solr too
- Now I see some other new bots I hadn't noticed before:
Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0) LinkCheck by Siteimprove.com
Consilio (WebHare Platform 4.28.2-dev); LinkChecker)
, which appears to be a university CMS- I will add
LinkCheck
,Consilio
, andWebHare
to the list of DSpace bot agents and purge them from Solr stats - COUNTER-Robots list already has
link.?check
but for some reason DSpace didn't match that and I see hits for some of these... - Maybe I should add
[Ll]ink.?[Cc]heck.?
to a custom list for now? - For now I added
Turnitin
to the new bots pull request on COUNTER-Robots
- I purged 20,000 hits from IPs and 45,000 hits from user agents
- I will revert the default "example" agents file back to the upstream master branch of COUNTER-Robots, and then add all my custom ones that are pending in pull requests they haven't merged yet:
$ diff --unchanged-line-format= --old-line-format= --new-line-format='%L' dspace/config/spiders/agents/example ~/src/git/COUNTER-Robots/COUNTER_Robots_list.txt
Citoid
ecointernet
GigablastOpenSource
Jersey\/\d
MarcEdit
OgScrper
okhttp
^Pattern\/\d
ReactorNetty\/\d
sqlmap
Typhoeus
7siters
- Just a note that I still can't deploy the
6_x-dev-atmire-modules
branch as it fails at ant update:
[java] java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'DefaultStorageUpdateConfig': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire method: public void com.atmire.statistics.util.StorageReportsUpdater.setStorageReportServices(java.util.List); nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'cuaEPersonStorageReportService': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO com.atmire.dspace.cua.CUAStorageReportServiceImpl$CUAEPersonStorageReportServiceImpl.CUAEPersonStorageReportDAO; nested exception is org.springframework.beans.factory.NoUniqueBeanDefinitionException: No qualifying bean of type [com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO] is defined: expected single matching bean but found 2: com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#0,com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#1
- I had told Atmire about this several weeks ago... but I reminded them again in the ticket
- Atmire says they are able to build fine, so I tried again and noticed that I had been building with
-Denv=dspacetest.cgiar.org
, which is not necessary for DSpace 6 of course - Once I removed that it builds fine
- Atmire says they are able to build fine, so I tried again and noticed that I had been building with
- I quickly re-applied the Font Awesome 5 changes to use SVG+JS instead of web fonts (from 2020-04) and things are looking good!
- Run all system updates on DSpace Test (linode26), deploy latest
6_x-dev-atmire-modules
branch, and reboot it
2020-07-02
- I need to export some Solr statistics data from CGSpace to test Salem's modifications to the dspace-statistics-api
- He modified it to query Solr on the fly instead of indexing it, which will be heavier and slower, but allows us to get more granular stats and countries/cities
- Because have so many records I want to use solr-import-export-json to get several months at a time with a date range, but it seems there are first issues with curl (need to disable globbing with
-g
and URL encode the range) - For reference, the Solr 4.10.x DateField docs
- This range works in Solr UI:
[2019-01-01T00:00:00Z TO 2019-06-30T23:59:59Z]
- As well in curl:
$ curl -g -s 'http://localhost:8081/solr/statistics-2019/select?q=*:*&fq=time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z%5D&rows=0&wt=json&indent=true'
{
"responseHeader":{
"status":0,
"QTime":0,
"params":{
"q":"*:*",
"indent":"true",
"fq":"time:[2019-01-01T00:00:00Z TO 2019-06-30T23:59:59Z]",
"rows":"0",
"wt":"json"}},
"response":{"numFound":7784285,"start":0,"docs":[]
}}
- But not in solr-import-export-json... hmmm... seems we need to URL encode only the date range itself, but not the brackets:
$ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f 'time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z]' -k uid
$ zstd /tmp/statistics-2019-1.json
- Then import it on my local dev environment:
$ zstd -d statistics-2019-1.json.zst
$ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/statistics-2019-1.json -k uid
2020-07-05
- Import twelve items into the CRP Livestock multimedia collection for Peter Ballantyne
- I ran the data through csv-metadata-quality first to validate and fix some common mistakes
- Interesting to check the data with
csvstat
to see if there are any duplicates
- Peter recently asked me to add Target audience (
cg.targetaudience
) to the CGSpace sidebar facets and AReS filters- I added it on my local DSpace test instance, but I'm waiting for him to tell me what he wants the header to be "Audiences" or "Target audience" etc...
- Peter also asked me to increase the size of links in the CGSpace "Welcome" text
- I suggested using the CSS
font-size: larger
property to just bump it up one relative to what it already is - He said it looks good, but that actually now the links seem OK (I told him to refresh, as I had made them bold a few days ago) so we don't need to adjust it actually
- I suggested using the CSS
- Mohammed Salem modified my dspace-statistics-api to query Solr directly so I started writing a script to benchmark it today
- I will monitor the JVM memory and CPU usage in visualvm, just like I did in 2019-04
- I noticed an issue with his limit parameter so I sent him some feedback on that in the meantime
- I noticed that we have 20,000 distinct values for
dc.subject
, but there are at least 500 that are lower or mixed case that we should simply uppercase without further thought:
dspace=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
- DSpace Test needs a different query because it is running DSpace 6 with UUIDs for everything:
dspace63=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
- Note the use of the POSIX character class :)
- I suggest that we generate a list of the top 5,000 values that don't match AGROVOC so that Sisay can correct them
- Start by getting the top 6,500 subjects (assuming that the top ~1,500 are valid from our previous work):
dspace=# \COPY (SELECT DISTINCT text_value, count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=57 GROUP BY text_value ORDER BY count DESC) TO /tmp/2020-07-05-subjects.csv WITH CSV;
COPY 19640
dspace=# \q
$ csvcut -c1 /tmp/2020-07-05-subjects-upper.csv | head -n 6500 > 2020-07-05-cgspace-subjects.txt
- Then start looking them up using
agrovoc-lookup.py
:
$ ./agrovoc-lookup.py -i 2020-07-05-cgspace-subjects.txt -om 2020-07-05-cgspace-subjects-matched.txt -or 2020-07-05-cgspace-subjects-rejected.txt -d
2020-07-06
- I made some optimizations to the suite of Python utility scripts in our DSpace directory as well as the csv-metadata-quality script
- Mostly to make more efficient usage of the requests cache and to use parameterized requests instead of building the request URL by concatenating the URL with query parameters
- I modified the
agrovoc-lookup.py
script to save its results as a CSV, with the subject, language, type of match (preferred, alternate, and total number of matches) rather than save two separate files- Note that I see
prefLabel
,matchedPrefLabel
, andaltLabel
in the REST API responses and I'm not sure what the second one means - I emailed FAO's AGROVOC contact to ask them
- They responded to say that
matchedPrefLabel
is not a property in SKOS/SKOSXL vocabulary, but their SKOSMOS system seems to use it to hint that the search terms matched aprefLabel
in another language - I will treat the
matchedPrefLabel
values as if they wereprefLabel
values for the indicated language then
- Note that I see
2020-07-07
- Peter asked me to send him a list of sponsors on CGSpace
dspace=# \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29 GROUP BY "dc.description.sponsorship" ORDER BY count DESC) TO /tmp/2020-07-07-sponsors.csv WITH CSV HEADER;
COPY 707
- I ran it quickly through my
csv-metadata-quality
tool and found two issues that I will correct withfix-metadata-values.py
on CGSpace immediately:
$ cat 2020-07-07-fix-sponsors.csv
dc.description.sponsorship,correct
"Ministe`re des Affaires Etrange`res et Européennes, France","Ministère des Affaires Étrangères et Européennes, France"
"Global Food Security Programme, United Kingdom","Global Food Security Programme, United Kingdom"
$ ./fix-metadata-values.py -i 2020-07-07-fix-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t correct -m 29
- Upload the Capacity Development July newsletter to CGSpace for Ben Hack because Abenet and Bizu usually do it, but they are currently offline due to the Internet being turned off in Ethiopia
- I implemented the Dimensions.ai badge on DSpace Test for Peter to see, as he's been asking me for awhile:
- It was easy once I figured out how to do the XSLT in the DSpace theme (need to get the DOI link and remove the "https://doi.org/" from the string)
- Actually this raised an issue that the Altmetric badges weren't enabled in our DSpace 6 branch yet because I had forgotten to copy the config
- Also, I noticed a big issue in both our DSpace 5 and DSpace 6 branches related to the
$identifier_doi
variable being defined incorrectly and thus never getting set (has to do with DRI) - I fixed both and now the Altmetric badge and the Dimensions badge both appear... nice
2020-07-08
- Generate a CSV of all the AGROVOC subjects that didn't match from the top 6500 I exported earlier this week:
$ csvgrep -c 'number of matches' -r "^0$" 2020-07-05-cgspace-subjects.csv | csvcut -c 1 > 2020-07-05-cgspace-invalid-subjects.csv
- Yesterday Gabriela from CIP emailed to say that she was removing the accents from her authors' names because of "funny character" issues with reports generated from CGSpace
- I told her that it's probably her Windows / Excel that is messing up the data, and she figured out how to open them correctly!
- Now she says she doesn't want to remove the accents after all and she sent me a new list of corrections
- I used csvgrep and found a few where she is still removing accents:
$ csvgrep -c 2 -r "^.+$" ~/Downloads/cip-authors-GH-20200706.csv | csvgrep -c 1 -r "^.*[À-ú].*$" | csvgrep -c 2 -r "^.*[À-ú].*$" -i | csvcut -c 1,2
dc.contributor.author,correction
"López, G.","Lopez, G."
"Gómez, R.","Gomez, R."
"García, M.","Garcia, M."
"Mejía, A.","Mejia, A."
"Quiróz, Roberto A.","Quiroz, R."
-
csvgrep from the csvkit suite is so cool:
- Select lines with column two (the correction) having a value
- Select lines with column one (the original author name) having an accent / diacritic
- Select lines with column two (the correction) NOT having an accent (ie, she's not removing an accent)
- Select columns one and two
-
Peter said he liked the work I didn on the badges yesterday so I put some finishing touches on it to detect more DOI URI styles and pushed it to the
5_x-prod
branch- I will port it to DSpace 6 soon
- I wrote a quick script to lookup organizations (affiliations) in the Research Organization Repository (ROR) JSON data release v5
- I want to use this to evaluate ROR as a controlled vocabulary for CGSpace and MELSpace
- I exported a list of affiliations from CGSpace:
dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-07-08-affiliations.csv WITH CSV HEADER;
- Then I stripped the CSV header and quotes to make it a plain text file and ran
ror-lookup.py
:
$ ./ror-lookup.py -i /tmp/2020-07-08-affiliations.txt -r ror.json -o 2020-07-08-affiliations-ror.csv -d
$ wc -l /tmp/2020-07-08-affiliations.txt
5866 /tmp/2020-07-08-affiliations.txt
$ csvgrep -c matched -m true 2020-07-08-affiliations-ror.csv | wc -l
1406
$ csvgrep -c matched -m false 2020-07-08-affiliations-ror.csv | wc -l
4462
- So, minus the CSV header, we have 1405 case-insensitive matches out of 5866 (23.9%)
2020-07-09
- Atmire responded to the ticket about DSpace 6 and Solr yesterday
- They said that the CUA issue is due to the "unmigrated" Solr records and that we should delete them
- I told them that the "unmigrated" IDs are a known issue in DSpace 6 and we should rather figure out why they are unmigrated
- I didn't see any discussion on the dspace-tech mailing list or on DSpace Jira about unmigrated IDs, so I sent a mail to the mailing list to ask
- I updated
ror-lookup.py
to check aliases and acronyms as well and now the results are better for CGSpace's affiliation list:
$ wc -l /tmp/2020-07-08-affiliations.txt
5866 /tmp/2020-07-08-affiliations.txt
$ csvgrep -c matched -m true 2020-07-08-affiliations-ror.csv | wc -l
1516
$ csvgrep -c matched -m false 2020-07-08-affiliations-ror.csv | wc -l
4352
- So now our matching improves to 1515 out of 5866 (25.8%)
- Gabriela from CIP said that I should run the author corrections minus those that remove accent characters so I will run it on CGSpace:
$ ./fix-metadata-values.py -i /tmp/2020-07-09-fix-90-cip-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correction -m 3
- Apply 110 fixes and 90 deletions to sponsorships that Peter sent me a few days ago:
$ ./fix-metadata-values.py -i /tmp/2020-07-07-fix-110-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct/action' -m 29
$ ./delete-metadata-values.py -i /tmp/2020-07-07-delete-90-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29
- Start a full Discovery re-index on CGSpace:
$ time chrt -b 0 dspace index-discovery -b
real 94m21.413s
user 9m40.364s
sys 2m37.246s
- I modified
crossref-funders-lookup.py
to be case insensitive and now CGSpace's sponsors match 173 out of 534 (32.4%):
$ ./crossref-funders-lookup.py -i 2020-07-09-cgspace-sponsors.txt -o 2020-07-09-cgspace-sponsors-crossref.csv -d -e a.orth@cgiar.org
$ wc -l 2020-07-09-cgspace-sponsors.txt
534 2020-07-09-cgspace-sponsors.txt
$ csvgrep -c matched -m true 2020-07-09-cgspace-sponsors-crossref.csv | wc -l
174
2020-07-12
- On 2020-07-10 Macaroni Bros emailed to ask if there are issues with CGSpace because they are getting HTTP 504 on the REST API
- First, I looked in Munin and I see high number of DSpace sessions and threads on Friday evening around midnight, though that was much later than his email:
- CPU load and memory were not high then, but there was some load on the database and firewall...
- Looking in the nginx logs I see a few IPs we've seen recently, like those 199.47.x.x IPs from Turnitin (which I need to remember to purge from Solr again because I didn't update the spider agents on CGSpace yet) and some new one 186.112.8.167
- Also, the Turnitin bot doesn't re-use its Tomcat JSESSIONID, I see this from today:
# grep 199.47.87 dspace.log.2020-07-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
2815
- So I need to add this alternative user-agent to the Tomcat Crawler Session Manager valve to force it to re-use a common bot session
- There are around 9,000 requests from
186.112.8.167
in Colombia and has the user agentJava/1.8.0_241
, but those were mostly to REST API and I don't see any hits in Solr - Earlier in the day Linode had alerted that there was high outgoing bandwidth
- I see some new bot from 134.155.96.78 made ~10,000 requests with the user agent... but it appears to already be in our DSpace user agent list via COUNTER-Robots:
Mozilla/5.0 (compatible; heritrix/3.4.0-SNAPSHOT-2019-02-07T13:53:20Z +http://ifm.uni-mannheim.de)
- Generate a list of sponsors to update our controlled vocabulary:
dspace=# \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29 GROUP BY "dc.description.sponsorship" ORDER BY count DESC LIMIT 125) TO /tmp/2020-07-12-sponsors.csv;
COPY 125
dspace=# \q
$ csvcut -c 1 --tabs /tmp/2020-07-12-sponsors.csv > dspace/config/controlled-vocabularies/dc-description-sponsorship.xml
# add XML formatting
$ dspace/config/controlled-vocabularies/dc-description-sponsorship.xml
$ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/dc-description-sponsorship.xml
- Deploy latest
5_x-prod
branch on CGSpace (linode18), run all system updates, and reboot the server- After rebooting it I had to restart Tomcat 7 once to get all Solr statistics cores to come up properly
2020-07-13
- I recommended to Marie–Angelique that we use ROR for CG Core V2 (#27)
- Purge 2,700 hits from CodeObia IP addresses in CGSpace statistics... I wonder when they will figure out how to use a bot user agent
2020-07-14
- I ran the
dspace cleanup -v
process on CGSpace and got an error:
Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
Detail: Key (bitstream_id)=(189618) is still referenced from table "bundle".
- The solution is, as always:
$ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (189618, 188837);'
UPDATE 1
- Udana from WLE asked me about some items that didn't show Altmetric donuts
- I checked his list and at least three of them actually did show donuts, and for four others I tweeted them manually to see if they would get a donut in a few hours:
2020-07-15
- All four IWMI items that I tweeted yesterday have Altmetric donuts with a score of 1 now...
- Export CGSpace countries to check them against ISO 3166-1 and ISO 3166-3 (historic countries):
dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-07-15-countries.csv;
COPY 194
- I wrote a script
iso3166-lookup.py
to check them:
$ ./iso3166-1-lookup.py -i /tmp/2020-07-15-countries.csv -o /tmp/2020-07-15-countries-resolved.csv
$ csvgrep -c matched -m false /tmp/2020-07-15-countries-resolved.csv
country,match type,matched
CAPE VERDE,,false
"KOREA, REPUBLIC",,false
PALESTINE,,false
"CONGO, DR",,false
COTE D'IVOIRE,,false
RUSSIA,,false
SYRIA,,false
"KOREA, DPR",,false
SWAZILAND,,false
MICRONESIA,,false
TIBET,,false
ZAIRE,,false
COCOS ISLANDS,,false
LAOS,,false
IRAN,,false
- Check the database for DOIs that are not in the preferred "https://doi.org/" format:
dspace=# \COPY (SELECT text_value as "cg.identifier.doi" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=220 AND text_value NOT LIKE 'https://doi.org/%') TO /tmp/2020-07-15-doi.csv WITH CSV HEADER;
COPY 186
- Then I imported them into OpenRefine and replaced them in a new "correct" column using this GREL transform:
value.replace("dx.doi.org", "doi.org").replace("http://", "https://").replace("https://dx,doi,org", "https://doi.org").replace("https://doi.dx.org", "https://doi.org").replace("https://dx.doi:", "https://doi.org").replace("DOI: ", "https://doi.org/").replace("doi: ", "https://doi.org/").replace("http://dx.doi.org", "https://doi.org").replace("https://dx. doi.org. ", "https://doi.org").replace("https://dx.doi", "https://doi.org").replace("https://dx.doi:", "https://doi.org/").replace("hdl.handle.net", "doi.org")
- Then I fixed the DOIs on CGSpace:
$ ./fix-metadata-values.py -i /tmp/2020-07-15-fix-164-DOIs.csv -db dspace -u dspace -p 'fuuu' -f cg.identifier.doi -t 'correct' -m 220
- I filed an issue on Debian's iso-codes project to ask why "Swaziland" does not appear in the ISO 3166-3 list of historical country names despite it being changed to "Eswatini" in 2018.
- Atmire responded about the Solr issue
- They said that it seems like a DSpace issue so that it's not their responsibility, and nobody responded to my question on the dspace-tech mailing list...
- I said I would try to do a migration on DSpace Test with more of CGSpace's Solr data to try and approximate how much of our data be affected
- I also asked them about the Tomcat 8.5 issue with CUA as well as the CUA group name issue that I had asked originally in April
2020-07-20
- Looking at the nginx logs on CGSpace (linode18) last night I see that the Macaroni Bros have started using a unique identifier for at least one of their harvesters:
217.64.192.138 - - [20/Jul/2020:01:01:39 +0200] "GET /rest/rest/bitstreams/114779/retrieve HTTP/1.0" 302 138 "-" "ILRI Livestock Website Publications importer BOT"
- I still see 12,000 records in Solr from this user agent, though.
- I wonder why the DSpace bot list didn't get those... because it has "bot" which should cause Solr to not log the hit
- I purged ~30,000 hits from Solr statistics based on the IPs above, but also for some agents like Drupal (which isn't in the list yet) and OgScrper (which is as of 2020-03)
- Some of my user agent patterns had been incorporated into COUNTER-Robots in 2020-07, but not all
- I closed the old pull request and created a new one
- Then I updated the lists in the
5_x-prod
and 6.x branches
- I re-ran the
check-spider-hits.sh
script with the new lists and purged another 14,000 more stats hits for several years each (2020, 2019, 2018, 2017, 2016), around 70,000 total - I looked at the CLARISA institutions list again, since I hadn't looked at it in over six months:
$ cat ~/Downloads/response_1595270924560.json | jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/"//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' > /tmp/clarisa-institutions.csv
- The API still needs a key unless you query from Swagger web interface
- They currently have 3,469 institutions...
- Also, they still combine multiple text names into one string along with acronyms and countries:
- Bundesministerium für wirtschaftliche Zusammenarbeit und Entwicklung / Federal Ministry of Economic Cooperation and Development (Germany)
- Ministerio del Ambiente / Ministry of Environment (Peru)
- Carthage University / Université de Carthage
- Sweet Potato Research Institute (SPRI) of Chinese Academy of Agricultural Sciences (CAAS)
- And I checked the list with my csv-metadata-quality tool and found it still has whitespace and unnecessary Unicode characters in several records:
$ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institutions-cleaned.csv
Removing excessive whitespace (name): Comitato Internazionale per lo Sviluppo dei Popoli / International Committee for the Development of Peoples
Removing excessive whitespace (name): Deutsche Landwirtschaftsgesellschaft / German agriculture society
Removing excessive whitespace (name): Institute of Arid Regions of Medenine
Replacing unnecessary Unicode (U+00AD): Bundesministerium für wirtschaftliche Zusammenarbeit und Entwicklung / Federal Ministry of Economic Cooperation and Development (Germany)
Removing unnecessary Unicode (U+200B): Agencia de Servicios a la Comercialización y Desarrollo de Mercados Agropecuarios
- I think the ROR is much better in every possible way
- Re-enabled all the yearly Solr statistics cores on DSpace Test (linode26) because they had been disabled by Atmire when they were testing on the server
- Run system updates on the server and reboot it
2020-07-21
- I built the latest 6.x branch on DSpace Test (linode26) and I noticed a few Font Awesome icons are missing in the Atmire CUA statlets
- One was simple to fix by adding it to our font library in
fontawesome.js
, but there are two more that are printing hex values instead of using HTML elements:
- One was simple to fix by adding it to our font library in
- I had previously thought these were fixed by setting the
font-family
on the elements, but it doesn't appear to be working now- I filed a ticket with Atmire to ask them to use the HTML elements instead, as their code already uses those elsewhere
- I don't want to go back to using the large webfonts with CSS because the SVG + JS method saves us ~140KiB and causes at least three fewer network requests
- I started processing the 2019 stats in a batch of 1 million on DSpace Test:
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics-2019
...
*** Statistics Records with Legacy Id ***
6,359,966 Bistream View
2,204,775 Item View
139,266 Community View
131,234 Collection View
948,529 Community Search
593,974 Collection Search
1,682,818 Unexpected Type & Full Site
--------------------------------------
12,060,562 TOTAL
- The statistics-2019 finished processing after about 9 hours so I started the 2018 ones:
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics-2018
*** Statistics Records with Legacy Id ***
3,684,394 Bistream View
2,183,032 Item View
131,222 Community View
79,348 Collection View
345,529 Collection Search
322,223 Community Search
874,107 Unexpected Type & Full Site
--------------------------------------
7,619,855 TOTAL
- Moayad finally made OpenRXV use a unique user agent:
OpenRXV harvesting bot; https://github.com/ilri/OpenRXV
- I see nearly 200,000 hits in Solr from the IP address, though, so I need to make sure those are old ones from before today
- I purged the hits for 178.62.93.141 as well as any from the old
axios/0.19.2
user agent - I made some requests with and without the new user agent and only the ones without showed up in Solr
- I purged the hits for 178.62.93.141 as well as any from the old
2020-07-22
- Atmire merged my latest bot suggestions to the COUNTER-Robots project:
- I will update the agent patterns on the CGSpace
5_x-prod
and 6.x branches - Make some changes to the Bootstrap CSS and HTML configuration to improve readability and style on the CG Core v2 metadata reference guide and send a pull request to Marie (#29)
- The
solr-upgrade-statistics-6x
tool keeps crashing due to memory issues when processing 2018 stats- I reduced the number of records per batch from 10,000 to 5,000 and increased the memory to 3072 and it still crashes...
- I reduced the number of records per batch to 1,000 and it works, but still took like twenty minutes before it even started!
- Eventually after processing a few million records it crashed with this error:
Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
- There were four records so I deleted them:
$ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:10</query></delete>'
- Meeting with Moayad and Peter and Abenet to discuss the latest AReS changes
2020-07-23
- I closed all issues in the OpenRXV and AReS GitHub repositories with screenshots so that Moayad can use them for his invoice
- The statistics-2018 core always crashes with the same error even after I deleted the "id:10" records...
- I started the statistics-2017 core and it finished in 3:44:15
- I started the statistics-2016 core and it finished in 2:27:08
- I started the statistics-2015 core and it finished in 1:07:38
- I started the statistics-2014 core and it finished in 1:45:44
- I started the statistics-2013 core and it finished in 1:41:50
- I started the statistics-2012 core and it finished in 1:23:36
- I started the statistics-2011 core and it finished in 0:39:37
- I started the statistics-2010 core and it finished in 0:01:46
2020-07-24
- Looking at the statistics-2019 Solr stats and see some interesting user agents and IPs
- For example, I see 568,000 requests from 66.109.27.x in 2019-10, all with the same exact user agent:
Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1
- Also, in the same month with the same exact user agent, I see 300,000 from 192.157.89.x
- The 66.109.27.x IPs belong to galaxyvisions.com
- The 192.157.89.x IPs belong to cologuard.com
- All these hosts were reported in late 2019 on abuseipdb.com
- Then I see another one 163.172.71.23 that made 215,000 requests in 2019-09 and 2019-08
- It belongs to poneytelecom.eu and is also in abuseipdb.com for PHP injection and directory traversal
- It uses this user agent:
Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
- In statistics-2018 I see more weird IPs
- 54.214.112.202 made 839,000 requests with no user agent...
- It is on Amazon Web Services (AWS) and made 100%
statistics_type:view
so I guess it was harvesting via the REST API
- It is on Amazon Web Services (AWS) and made 100%
- A few IPs owned by perfectip.net made 400,000 requests in 2018-01
- They are 2607:fa98:40:9:26b6:fdff:feff:195d and 2607:fa98:40:9:26b6:fdff:feff:1888 and 2607:fa98:40:9:26b6:fdff:feff:1c96 and 70.36.107.49
- All the requests used this user agent:
- 54.214.112.202 made 839,000 requests with no user agent...
Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36
- Then there is 213.139.53.62 in 2018, which is on Orange Telecom Jordan, so it's definitely CodeObia / ICARDA and I will purge them
- Jesus, and then there are 100,000 from the ILRI harvestor on Linode on 2a01:7e00::f03c:91ff:fe0a:d645
- Jesus fuck there is 46.101.86.248 making 15,000 requests per month in 2018 with no user agent...
- Jesus fuck there is 84.38.130.177 in Latvia that was making 75,000 requests in 2018-11 and 2018-10
- Jesus fuck there is 104.198.9.108 on Google Cloud that was making 30,000 requests with no user agent
- I will purge the hits from all the following IPs:
192.157.89.4
192.157.89.5
192.157.89.6
192.157.89.7
66.109.27.142
66.109.27.139
66.109.27.138
66.109.27.140
66.109.27.141
2607:fa98:40:9:26b6:fdff:feff:1888
2607:fa98:40:9:26b6:fdff:feff:195d
2607:fa98:40:9:26b6:fdff:feff:1c96
213.139.53.62
2a01:7e00::f03c:91ff:fe0a:d645
46.101.86.248
54.214.112.202
84.38.130.177
104.198.9.108
70.36.107.49
- In total these accounted for the following amount of requests in each year:
- 2020: 1436
- 2019: 960274
- 2018: 1588149
- I noticed a few other user agents that should be purged too:
^Java\/\d{1,2}.\d
FlipboardProxy\/\d
API scraper
RebelMouse\/\d
Iframely\/\d
Python\/\d
Ruby
NING\/\d
ubermetrics-technologies\.com
Jetty\/\d
scalaj-http\/\d
mailto\:team@impactstory\.org
- I purged them from the stats too:
- 2020: 19553
- 2019: 29745
- 2018: 18083
- 2017: 19399
- 2016: 16283
- 2015: 16659
- 2014: 713
2020-07-26
- I continued with the Solr ID to UUID migrations (solr-upgrade-statistics-6x) from last week and updated my notes for each core above
- After all cores finished migrating I optimized them to delete old documents
- Export some of the CGSpace Solr stats minus the Atmire CUA schema additions for Salem to play with:
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f 'time:[2019-01-01T00\:00\:00Z TO 2019-06-30T23\:59\:59Z]' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
-
Run system updates on DSpace Test (linode26) and reboot it
-
I looked into the umigrated Solr records more and they are overwhelmingly
type: 5
(which means "Site" according to the DSpace constants):- statistics
- id: -1-unmigrated
- type 5: 167316
- id: 0-unmigrated
- type 5: 32581
- id: -1
- type 5: 10198
- id: -1-unmigrated
- statistics-2019
- id: -1
- type 5: 2690500
- id: -1-unmigrated
- type 5: 1348202
- id: 0-unmigrated
- type 5: 141576
- id: -1
- statistics-2018
- id: -1
- type 5: 365466
- id: -1-unmigrated
- type 5: 254680
- id: 0-unmigrated
- type 5: 204854
- 145870-unmigrated
- type 0: 83235
- id: -1
- statistics-2017
- id: -1
- type 5: 808346
- id: -1-unmigrated
- type 5: 598022
- id: 0-unmigrated
- type 5: 254014
- 145870-unmigrated
- type 0: 28168
- bundleName THUMBNAIL: 28168
- statistics
-
There is another one appears in 2018 and 2017 at least of type 0, which would be download
- In that case the id is of a bitstream that no longer exists...?
-
I started processing Solr stats with the Atmire tool now:
$ dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -c statistics -f -t 12
- This one failed after a few hours:
Record uid: c4b5974a-025d-4adc-b6c3-c8846048b62b couldn't be processed
com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: c4b5974a-025d-4adc-b6c3-c8846048b62b, an error occured in the com.atmire.statistics.util.update.atomic.processor.ContainerOwnerDBProcessor
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(SourceFile:304)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:176)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.performRun(SourceFile:161)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.update(SourceFile:128)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI.main(SourceFile:78)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
Caused by: java.lang.NullPointerException
Run 2 — 100% — 2,237,670/2,239,523 docs — 12s — 2h 25m 41s
Run 2 took 2h 25m 41s
179,310 docs failed to process
If run the update again with the resume option (-r) they will be reattempted
- I started the same script for the statistics-2019 core (12 million records...)
- Update an ILRI author's name on CGSpace:
$ ./fix-metadata-values.py -i /tmp/2020-07-27-fix-ILRI-author.csv -db dspace -u cgspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
Fixed 13 occurences of: Muloi, D.
Fixed 4 occurences of: Muloi, D.M.
2020-07-28
- I started analyzing the situation with the cases I've seen where a Solr record fails to be migrated:
id: 0-unmigrated
are mostly (all?)type: 5
aka site viewid: -1-unmigrated
are mostly (all?)type: 5
aka site viewid: -1
are mostly (all?)type: 5
aka site viewid: 59184-unmigrated
where "59184" is the id of an item or bitstream that no longer exists
- Why doesn't Atmire's code ignore any id with "-unmigrated"?
- I sent feedback to Atmire since they had responded to my previous question yesterday
- They said that the DSpace 6 version of CUA does not work with Tomcat 8.5...
- I spent a few hours trying to write a Jython-based curation task to update ISO 3166-1 Alpha2 country codes based on each item's ISO 3166-1 country
- Peter doesn't want to use the ISO 3166-1 list because he objects to a few names, so I thought we might be able to use country codes or numeric codes and update the names with a curation task
- The work is very rough but kinda works: mytask.py
- What is nice is that the
dso.update()
method updates the data the "DSpace way" so we don't need to re-index Solr - I had a clever idea to "vendor" the pycountry code using
pip install pycountry -t
, but pycountry dropped support for Python 2 in 2019 so we can only use an outdated version - In the end it's really limiting to this particular task in Jython because we are stuck with Python 2, we can't use virtual environments, and there is a lot of code we'd need to write to be able to handle the ISO 3166 country lists
- Python 2 is no longer supported by the Python community anyways so it's probably better to figure out how to do this in Java
2020-07-29
- The Atmire stats tool (com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI) created 150GB of log files due to errors and the disk got full on DSpace Test (linode26)
- This morning I had noticed that the run I started last night said that 54,000,000 (54 million!) records failed to process, but the core only had 6 million or so documents to process...!
- I removed the large log files and optimized the Solr core
2020-07-30
- Looking into ISO 3166-1 from the iso-codes package
- I see that all current 249 countries have names, 173 have official names, and 6 have common names:
# grep -c numeric /usr/share/iso-codes/json/iso_3166-1.json
249
# grep -c -E '"name":' /usr/share/iso-codes/json/iso_3166-1.json
249
# grep -c -E '"official_name":' /usr/share/iso-codes/json/iso_3166-1.json
173
# grep -c -E '"common_name":' /usr/share/iso-codes/json/iso_3166-1.json
6
- Wow, the
CC-BY-NC-ND-3.0-IGO
license that I had requested in 2019-02 was finally merged into SPDX...