mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2020-10-19
This commit is contained in:
@ -410,4 +410,183 @@ user 7m59.182s
|
||||
sys 2m22.713s
|
||||
```
|
||||
|
||||
## 2020-10-18
|
||||
|
||||
- Macaroni Bros wrote to me to ask why some of their CCAFS harvesting is failing
|
||||
- They are scraping HTML from /browse responses like this:
|
||||
|
||||
https://cgspace.cgiar.org/browse?type=crpsubject&value=Climate+Change%2C+Agriculture+and+Food+Security&XML&rpp=5000
|
||||
|
||||
- They are using the user agent "CCAFS Website Publications importer BOT" so they are getting rate limited by nginx
|
||||
- Ideally they would use the REST `find-by-metadata-field` endpoint, but it is *really* slow for large result sets (like twenty minutes!):
|
||||
|
||||
```
|
||||
$ curl -f -H "CCAFS Website Publications importer BOT" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field?limit=100" -d '{"key":"cg.contributor.crp", "value":"Climate Change, Agriculture and Food Security","language": "en_US"}'
|
||||
```
|
||||
|
||||
- For now I will whitelist their user agent so that they can continue scraping /browse
|
||||
- I figured out that the mappings for AReS are stored in Elasticsearch
|
||||
- There is a Kibana interface running on port 5601 that can help explore the values in the index
|
||||
- I can interact with Elasticsearch by sending requests, for example to delete an item by its `_id`:
|
||||
|
||||
```
|
||||
$ curl -XPOST "localhost:9200/openrxv-values/_delete_by_query" -H 'Content-Type: application/json' -d'
|
||||
{
|
||||
"query": {
|
||||
"match": {
|
||||
"_id": "64j_THMBiwiQ-PKfCSlI"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- I added a new find/replace:
|
||||
|
||||
```
|
||||
$ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d'
|
||||
{
|
||||
"find": "ALAN1",
|
||||
"replace": "ALAN2",
|
||||
}
|
||||
'
|
||||
```
|
||||
|
||||
- I see it in Kibana, and I can search it in Elasticsearch, but I don't see it in OpenRXV's mapping values dashboard
|
||||
- Now I deleted everything in the `openrxv-values` index:
|
||||
|
||||
```
|
||||
$ curl -XDELETE http://localhost:9200/openrxv-values
|
||||
```
|
||||
|
||||
- Then I tried posting it again:
|
||||
|
||||
```
|
||||
$ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d'
|
||||
{
|
||||
"find": "ALAN1",
|
||||
"replace": "ALAN2",
|
||||
}
|
||||
'
|
||||
```
|
||||
|
||||
- But I still don't see it in AReS
|
||||
- Interesting! I added a find/replace manually in AReS and now I see the one I POSTed...
|
||||
- I fixed a few bugs in the Simple and Extended PDF reports on AReS
|
||||
- Add missing ISI Journal and Type to Simple PDF report
|
||||
- Fix DOIs in Simple PDF report
|
||||
- Add missing "https://hdl.handle.net" to Handles in Extented PDF report
|
||||
- Testing Atmire CUA and L&R based on their feedback from a few days ago
|
||||
- I no longer get the NullPointerException from CUA when importing metadata on the command line (!)
|
||||
- Listings and Reports now shows results for simple queries that I tested (!), though it seems that there are some new JavaScript libraries I need to allow in nginx
|
||||
- I sent a mail to the dspace-tech mailing list asking about the error with DSpace 6's "Export Search Metadata" function
|
||||
- If I search for an author like "Orth, Alan" it gives an HTTP 400, but if I search for "Orth" alone it exports a CSV
|
||||
- I replicated the same issue on demo.dspace.org
|
||||
|
||||
## 2020-10-19
|
||||
|
||||
- Last night I learned how to POST mappings to Elasticsearch for AReS:
|
||||
|
||||
```
|
||||
$ curl -XDELETE http://localhost:9200/openrxv-values
|
||||
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @./mapping.json
|
||||
```
|
||||
|
||||
- The JSON file looks like this, with one instruction on each line:
|
||||
|
||||
```
|
||||
{"index":{}}
|
||||
{ "find": "CRP on Dryland Systems - DS", "replace": "Dryland Systems" }
|
||||
{"index":{}}
|
||||
{ "find": "FISH", "replace": "Fish" }
|
||||
```
|
||||
|
||||
- Adjust the report templates on AReS based on some of Peter's feedback
|
||||
- I wrote a quick Python script to filter and convert the old AReS mappings to [Elasticsearch's Bulk API](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html) format:
|
||||
|
||||
```python
|
||||
#!/usr/bin/env python3
|
||||
|
||||
import json
|
||||
import re
|
||||
|
||||
f = open('/tmp/mapping.json', 'r')
|
||||
data = json.load(f)
|
||||
|
||||
# Iterate over old mapping file, which is in format "find": "replace", ie:
|
||||
#
|
||||
# "alan": "ALAN"
|
||||
#
|
||||
# And convert to proper dictionaries for import into Elasticsearch's Bulk API:
|
||||
#
|
||||
# { "find": "alan", "replace": "ALAN" }
|
||||
#
|
||||
for find, replace in data.items():
|
||||
# Skip all upper and all lower case strings because they are indicative of
|
||||
# some AGROVOC or other mappings we no longer want to do
|
||||
if find.isupper() or find.islower() or replace.isupper() or replace.islower():
|
||||
continue
|
||||
|
||||
# Skip replacements with acronyms like:
|
||||
#
|
||||
# International Livestock Research Institute - ILRI
|
||||
#
|
||||
acronym_pattern = re.compile(r"[A-Z]+$")
|
||||
acronym_pattern_match = acronym_pattern.search(replace)
|
||||
if acronym_pattern_match is not None:
|
||||
continue
|
||||
|
||||
mapping = { "find": find, "replace": replace }
|
||||
|
||||
# Print command for Elasticsearch
|
||||
print('{"index":{}}')
|
||||
print(json.dumps(mapping))
|
||||
|
||||
f.close()
|
||||
```
|
||||
|
||||
- It filters all upper and lower case strings as well as any replacements that end in an acronym like "- ILRI", reducing the number of mappings from around 4,000 to about 900
|
||||
- I deleted the existing `openrxv-values` Elasticsearch core and then POSTed it:
|
||||
|
||||
```
|
||||
$ ./convert-mapping.py > /tmp/elastic-mappings.txt
|
||||
$ curl -XDELETE http://localhost:9200/openrxv-values
|
||||
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/elastic-mappings.txt
|
||||
```
|
||||
|
||||
- Then in AReS I didn't see the mappings in the dashboard until I added a new one manually, after which they all appeared
|
||||
- I started a new harvesting
|
||||
- I checked the CIMMYT DSpace repository and I see they have [the REST API enabled](https://repository.cimmyt.org/rest)
|
||||
- The data doesn't look too bad actually: they have countries in title case, AGROVOC in upper case, CRPs, etc
|
||||
- According to [their OAI](https://repository.cimmyt.org/oai/request?verb=ListRecords&metadataPrefix=oai_dc) they have 6,500 items in the repository
|
||||
- I would be interested to explore the possibility to harvest them...
|
||||
- Bosede said they were having problems with the "Access" step during item submission
|
||||
- I looked at the Munin graphs for PostgreSQL and both connections and locks look normal so I'm not sure what it could be
|
||||
- I restarted the PostgreSQL service just to see if that would help
|
||||
- I ran the `dspace cleanup -v` process on CGSpace and got an error:
|
||||
|
||||
```
|
||||
Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
|
||||
Detail: Key (bitstream_id)=(192921) is still referenced from table "bundle".
|
||||
```
|
||||
|
||||
- The solution is, as always:
|
||||
|
||||
```
|
||||
$ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (192921);'
|
||||
UPDATE 1
|
||||
```
|
||||
|
||||
- After looking at the CGSpace Solr stats for 2020-10 I found some hits to purge:
|
||||
|
||||
```
|
||||
$ ./check-spider-hits.sh -f /tmp/agents -s statistics -u http://localhost:8083/solr -p
|
||||
|
||||
Purging 2474 hits from ShortLinkTranslate in statistics
|
||||
Purging 2568 hits from RI\/1\.0 in statistics
|
||||
Purging 1851 hits from ILRI Livestock Website Publications importer BOT in statistics
|
||||
Purging 1282 hits from curl in statistics
|
||||
|
||||
Total number of bot hits purged: 8174
|
||||
```
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user