mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2020-10-13 and 2020-10-14
This commit is contained in:
@ -291,4 +291,72 @@ Purging 920 hits from Scoop\.it in statistics-2018
|
||||
Total number of bot hits purged: 3684
|
||||
```
|
||||
|
||||
## 2020-10-13
|
||||
|
||||
- Skype with Peter about AReS again
|
||||
- We decided to use Title Case for our countries on CGSpace to minimize the need for mapping on AReS
|
||||
- We did some work to add a dozen more mappings for strange and incorrect CRPs on AReS
|
||||
- I can update the country metadata in PostgreSQL like this:
|
||||
|
||||
```
|
||||
dspace=> BEGIN;
|
||||
dspace=> UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE resource_type_id=2 AND metadata_field_id=228;
|
||||
UPDATE 51756
|
||||
dspace=> COMMIT;
|
||||
```
|
||||
|
||||
- I will need to pay special attention to Côte d'Ivoire, Bosnia and Herzegovina, and a few others though... maybe better do search and replace using `fix-metadata-values.csv`
|
||||
- Export a list of distinct values from the database:
|
||||
|
||||
```
|
||||
dspace=> \COPY (SELECT DISTINCT(text_value) as "cg.coverage.country" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-10-13-countries.csv WITH CSV HEADER;
|
||||
COPY 195
|
||||
```
|
||||
|
||||
- Then use OpenRefine and make a new column for corrections, then use this GREL to convert to title case: `value.toTitlecase()`
|
||||
- I still had to double check everything to catch some corner cases (Andorra, Timor-leste, etc)
|
||||
- For the input forms I found out how to do a complicated search and replace in vim:
|
||||
|
||||
```
|
||||
:'<,'>s/\<\(pair\|displayed\|stored\|value\|AND\)\@!\(\w\)\(\w*\|\)\>/\u\2\L\3/g
|
||||
```
|
||||
|
||||
- It uses a [negative lookahead](https://jbodah.github.io/blog/2016/11/01/positivenegative-lookaheadlookbehind-vim/) (aka "lookaround" in PCRE?) to match words that are *not* "pair", "displayed", etc because we don't want to edit the XML tags themselves...
|
||||
- I had to fix a few manually after doing this, as above with PostgreSQL
|
||||
|
||||
## 2020-10-14
|
||||
|
||||
- I discussed the title casing of countries with Abenet and she suggested we also apply title casing to regions
|
||||
- I exported the list of regions from the database:
|
||||
|
||||
```
|
||||
dspace=> \COPY (SELECT DISTINCT(text_value) as "cg.coverage.region" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=227) TO /tmp/2020-10-14-regions.csv WITH CSV HEADER;
|
||||
COPY 34
|
||||
```
|
||||
|
||||
- I did the same as the countries in OpenRefine for the database values and in vim for the input forms
|
||||
- After testing the replacements locally I ran them on CGSpace:
|
||||
|
||||
```
|
||||
$ ./fix-metadata-values.py -i /tmp/2020-10-13-CGSpace-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
|
||||
$ ./fix-metadata-values.py -i /tmp/2020-10-14-CGSpace-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227
|
||||
```
|
||||
|
||||
- Then I started a full re-indexing:
|
||||
|
||||
```
|
||||
$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||||
|
||||
real 88m21.678s
|
||||
user 7m59.182s
|
||||
sys 2m22.713s
|
||||
```
|
||||
|
||||
- I added a dozen or so more mappings to fix some country outliers on AReS
|
||||
- I will start a fresh harvest there once the Discovery update is done on CGSpace
|
||||
- I also adjusted my `fix-metadata-values.py` and `delete-metadata-values.py` scripts to work on DSpace 6 where there is no more `resource_type_id` field
|
||||
- I will need to do it on a few more scripts as well, but I'll do that after we migrate to DSpace 6 because those scripts are less important
|
||||
- I found a new setting in DSpace 6's `usage-statistics.cfg` about case insensitive matching of bots that defaults to false, so I enabled it in our DSpace 6 branch
|
||||
- I am curious to see if that resolves the strange issues I noticed yesterday about bot matching of patterns in the spider agents file completely not working
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user