mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2020-01-29
This commit is contained in:
@ -324,4 +324,46 @@ COPY 77
|
||||
- I did some metadata enrichment by searching for the items and copying relevant data from journal pages
|
||||
- I asked Bosede to try to do the same for the rest of the journal articles
|
||||
|
||||
## 2020-01-29
|
||||
|
||||
- Normalize about 4,500 DOI, YouTube, and SlideShare links on CGSpace that are missing HTTPS or using old format:
|
||||
|
||||
```
|
||||
UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://www.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://www.doi.org%';
|
||||
UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://doi.org%';
|
||||
UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://dx.doi.org%';
|
||||
UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
|
||||
UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://www.youtube.com', 'https://www.youtube.com') WHERE resource_type_id = 2 AND metadata_field_id = 219 AND text_value LIKE 'http://www.youtube.com%';
|
||||
UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://www.slideshare.net', 'https://www.slideshare.net') WHERE resource_type_id = 2 AND metadata_field_id = 219 AND text_value LIKE 'http://www.slideshare.net%';
|
||||
```
|
||||
|
||||
- I exported a list of all of our ISSNs with item IDs so that I could fix them in OpenRefine and submit them with multi-value separators to DSpace metadata import:
|
||||
|
||||
```
|
||||
dspace=# \COPY (SELECT resource_id as "id", text_value as "dc.identifier.issn" FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 21) to /tmp/2020-01-29-issn.csv WITH CSV HEADER;
|
||||
COPY 23339
|
||||
```
|
||||
|
||||
- Then, after spending two hours correcting 1,000 ISSNs I realized that I need to normalize the `text_lang` fields in the database first or else these will all look like changes due to the "en_US" and NULL, etc (for both ISSN and ISBN):
|
||||
|
||||
```
|
||||
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id = 2 AND metadata_field_id IN (20,21);
|
||||
UPDATE 30454
|
||||
```
|
||||
|
||||
- Then I realized that my initial PostgreSQL query wasn't so genius because if a field already has multiple values it will appear on separate lines with the same ID, so when `dspace metadata-import` sees it, the change will be removed and added, or added and removed, depending on the order it is seen!
|
||||
- A better course of action is to select the distinct ones and then correct them using `fix-metadata-values.py`...
|
||||
|
||||
```
|
||||
dspace=# \COPY (SELECT DISTINCT text_value as "dc.identifier.issn[en_US]", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 21 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-29-issn-distinct.csv WITH CSV HEADER;
|
||||
COPY 2900
|
||||
```
|
||||
|
||||
- I re-applied all my corrections, filtering out things like multi-value separators and values that are actually ISBNs so I can fix them later
|
||||
- Then I applied 181 fixes for ISSNs using `fix-metadata-values.py` on DSpace Test and CGSpace (after testing locally):
|
||||
|
||||
``
|
||||
$ ./fix-metadata-values.py -i /tmp/2020-01-29-ISSNs-Distinct.csv -db dspace -u dspace -p 'fuuu' -f 'dc.identifier.issn[en_US]' -m 21 -t correct -d
|
||||
```
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user