- Troubleshooting the missing Altmetric scores on AReS
- Turns out that I didn't actually fix them last month because the check for `content.altmetric` still exists, and I can't access the DOIs using `_h.source.DOI` for some reason
- I can access all other kinds of item metadata using the Elasticsearch label, but not DOI!!!
- I will change `DOI` to `tomato` in the repository setup and start a re-harvest... I need to see if this is some kind of reserved word or something...
- Checking last month's Solr statistics to see if there are any new bots that I need to purge and add to the list
- made 50,000 requests on one day in August, and it is using this user agent: `Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36`
- It's a fixed line ISP in Montpellier according to AbuseIPDB.com, and has not been flagged as abusive, so it must be some CGIAR SMO person doing some web application harvesting from the browser
- is in Sweden and made 46,000 requests in August and it is using this user agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:84.0) Gecko/20100101 Firefox/84.0`
- is on Amazon and made 28,000 requests with this user agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36`
- is in Sweden and made 9,000 requests with this user agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:84.0) Gecko/20100101 Firefox/84.0`
- is in Germany and made 6,000 requests with this user agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0 BoldBrains SC/`
- is in Amazon and made 3,000 requests with this user agent: `Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36`
- I also noticed that we still have tons (25,000) of requests by MSNbot using this normal-looking user agent: `Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko`
- I can identify them by their reverse DNS: msnbot-40-77-167-105.search.msn.com.
- I had already purged a bunch of these by their IPs in 2021-06, so it looks like I have to do that again
- While looking at the MSN requests I noticed tons of requests from another strange host using reverse IP DNS: malta2095.startdedicated.com., astra5139.startdedicated.com., and many others
- They must be related, because I see them all using the exact same user agent: `Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko`
- So this startdedicated.com DNS is some Bing bot also...
- I extracted all the IPs and purged them using my `check-spider-ip-hits.sh` script
- In total I purged 225,000 hits...
## 2021-09-12
- Start a harvest on AReS
## 2021-09-13
- Mishell Portilla asked me about thumbnails on CGSpace being small
- For example, [10568/114576](https://cgspace.cgiar.org/handle/10568/114576) has a lot of white space on the left side
- Start writing a Python script to parse `input-forms.xml` to create documentation for submissions
- Found a bug with the DSpace 6.3 REST API, it returns HTTP 500 for `dc.title` even though it exists in the registry: https://demo.dspace.org/rest/registries/schema/dc/metadata-fields/title
- Seems to be with any field that does not have a qualifier
- I filed an issue: https://github.com/DSpace/DSpace/issues/7946
ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)
- I sent a message to CGNET to ask about the server settings and see if our IP is still whitelisted
- It turns out that CGNET created a new Active Directory server (AZCGNEROOT3.cgiarad.org) and decomissioned the old one last week
- I updated the configuration on CGSpace and confirmed that it is working
- Create another test account for Rafael from Bioversity-CIAT to submit some items to DSpace Test:
$ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p 'fuuuuuuuu'
- I added the account to the Alliance Admins account, which is should allow him to submit to any Alliance collection
- According to my notes from [2020-10]({{< relref "2020-10.md" >}}) the account must be in the admin group in order to submit via the REST API
- Run `dspace cleanup -v` process on CGSpace to clean up old bitstreams
- Export lists of authors, donors, and affiliations for Peter Ballantyne to clean up:
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-authors.csv WITH CSV HEADER;
COPY 80901
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.donor", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 248 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-donors.csv WITH CSV HEADER;
COPY 1274
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-affiliations.csv WITH CSV HEADER;
- Peter sent me back the corrections for the affiliations
- It is about 1,280 corrections and fourteen deletions
- I cleaned them up in csv-metadata-quality and then extracted the deletes and fixes to separate files to run with `fix-metadata-values.py` and `delete-metadata-values.py`:
- Then I updated the controlled vocabulary for affiliations by exporting the top 1,000 used terms:
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2021-09-23-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-09-23-affiliations.csv | sed 1d > /tmp/affiliations.txt
- Peter also sent me 310 corrections and 234 deletions for donors so I applied those and updated the controlled vocabularies too
- Move some One CGIAR-related collections around the CGSpace hierarchy for Peter Ballantyne
- Mohammed Salem asked me for an ID to UUID mapping for CGSpace collections, so I generated one similar to the ID one I sent him in 2020-11:
localhost/dspace63= > \COPY (SELECT collection_id,uuid FROM collection WHERE collection_id IS NOT NULL) TO /tmp/2021-09-23-collection-id2uuid.csv WITH CSV HEADER;
- Peter and Abenet agreed that we should consider converting more of our UPPER CASE metadata values to Title Case
- It seems that these fields are all still using UPPER CASE:
- cg.subject.alliancebiovciat
- cg.species.breed
- cg.subject.bioversity
- cg.subject.ccafs
- cg.subject.ciat
- cg.subject.cip
- cg.identifier.iitatheme
- cg.subject.iita
- cg.subject.ilri
- cg.subject.pabra
- cg.river.basin
- cg.coverage.subregion (done)
- dcterms.audience (done)
- cg.subject.wle
- We can do some of these without even asking anyone, for example `cg.coverage.subregion`, `cg.river.basin`, and `dcterms.audience`
- First, I will look at `cg.coverage.subregion`
- These should ideally come from ISO 3166-2 subdivisions
- I will sentence case them and then create a controlled vocabulary from those that are matching (and worry about cleaning the rest up later)
localhost/dspace63= > UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=231;
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.coverage.subregion" FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 231) to /tmp/2021-09-24-subregions.txt;
COPY 1200
- Then I process the list for matches with my `subdivision-lookup.py` script, and extract only the values that matched:
- Then I updated the controlled vocabulary in the submission forms
- I did the same for `dcterms.audience`, taking special care to a few all-caps values:
localhost/dspace63= > UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value != 'NGOS' AND text_value != 'CGIAR';
localhost/dspace63= > UPDATE metadatavalue SET text_value='NGOs' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value = 'NGOS';