mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Update notes for 2020-07-09
This commit is contained in:
@ -345,15 +345,70 @@ dc.contributor.author,correction
|
||||
dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-07-08-affiliations.csv WITH CSV HEADER;
|
||||
```
|
||||
|
||||
- Then I stripped the header and quotes to make it a plain text file and ran `ror-lookup.py`:
|
||||
- Then I stripped the CSV header and quotes to make it a plain text file and ran `ror-lookup.py`:
|
||||
|
||||
```
|
||||
$ ./ror-lookup.py -i /tmp/2020-07-08-affiliations.txt -r ror.json -o 2020-07-08-affiliations-ror.csv -d
|
||||
$ ./ror-lookup.py -i /tmp/2020-07-08-affiliations.txt -r ror.json -o 2020-07-08-affiliations-ror.csv -d
|
||||
$ csvgrep -c 2 -m true 2020-07-08-affiliations-ror.csv | wc -l
|
||||
1378
|
||||
$ csvgrep -c 2 -m false 2020-07-08-affiliations-ror.csv | wc -l
|
||||
4490
|
||||
$ wc -l /tmp/2020-07-08-affiliations.txt
|
||||
5866 /tmp/2020-07-08-affiliations.txt
|
||||
$ csvgrep -c matched -m true 2020-07-08-affiliations-ror.csv | wc -l
|
||||
1406
|
||||
$ csvgrep -c matched -m false 2020-07-08-affiliations-ror.csv | wc -l
|
||||
4462
|
||||
```
|
||||
|
||||
- So, minus the CSV header, we have 1405 case-insensitive matches out of 5866 (23.9%)
|
||||
|
||||
|
||||
## 2020-07-09
|
||||
|
||||
- Atmire responded to the ticket about DSpace 6 and Solr yesterday
|
||||
- They said that the CUA issue is due to the "unmigrated" Solr records and that we should delete them
|
||||
- I told them that [the "unmigrated" IDs are a known issue in DSpace 6](https://wiki.lyrasis.org/display/DSDOC6x/SOLR+Statistics+Maintenance) and we should rather figure out why they are unmigrated
|
||||
- I didn't see any discussion on the dspace-tech mailing list or on DSpace Jira about unmigrated IDs, so I sent a mail to the mailing list to ask
|
||||
- I updated `ror-lookup.py` to check aliases and acronyms as well and now the results are better for CGSpace's affiliation list:
|
||||
|
||||
```
|
||||
$ wc -l /tmp/2020-07-08-affiliations.txt
|
||||
5866 /tmp/2020-07-08-affiliations.txt
|
||||
$ csvgrep -c matched -m true 2020-07-08-affiliations-ror.csv | wc -l
|
||||
1516
|
||||
$ csvgrep -c matched -m false 2020-07-08-affiliations-ror.csv | wc -l
|
||||
4352
|
||||
```
|
||||
|
||||
- So now our matching improves to 1515 out of 5866 (25.8%)
|
||||
- Gabriela from CIP said that I should run the author corrections minus those that remove accent characters so I will run it on CGSpace:
|
||||
|
||||
```
|
||||
$ ./fix-metadata-values.py -i /tmp/2020-07-09-fix-90-cip-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correction -m 3
|
||||
```
|
||||
|
||||
- Apply 110 fixes and 90 deletions to sponsorships that Peter sent me a few days ago:
|
||||
|
||||
```
|
||||
$ ./fix-metadata-values.py -i /tmp/2020-07-07-fix-110-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct/action' -m 29
|
||||
$ ./delete-metadata-values.py -i /tmp/2020-07-07-delete-90-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29
|
||||
```
|
||||
|
||||
- Start a full Discovery re-index on CGSpace:
|
||||
|
||||
```
|
||||
$ time chrt -b 0 dspace index-discovery -b
|
||||
|
||||
real 94m21.413s
|
||||
user 9m40.364s
|
||||
sys 2m37.246s
|
||||
```
|
||||
|
||||
- I modified `crossref-funders-lookup.py` to be case insensitive and now CGSpace's sponsors match 173 out of 534 (32.4%):
|
||||
|
||||
```
|
||||
$ ./crossref-funders-lookup.py -i 2020-07-09-cgspace-sponsors.txt -o 2020-07-09-cgspace-sponsors-crossref.csv -d -e a.orth@cgiar.org
|
||||
$ wc -l 2020-07-09-cgspace-sponsors.txt
|
||||
534 2020-07-09-cgspace-sponsors.txt
|
||||
$ csvgrep -c matched -m true 2020-07-09-cgspace-sponsors-crossref.csv | wc -l
|
||||
174
|
||||
```
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user