Add notes

2025-01-27 05:49:12 +01:00 · 2022-02-23 14:46:23 +03:00
parent 9b4498de04
commit 3baa93a1f2
110 changed files with 397 additions and 139 deletions
--- a/content/posts/2022-02.md
+++ b/content/posts/2022-02.md
@@ -412,4 +412,129 @@ $ csvjoin -c id /tmp/2022-02-14-tac-batch3-401-700.csv /tmp/tac3-filenames.csv >

 - I sent these 300 items to Gaia...

+## 2022-02-16
+
+- Upgrade PostgreSQL on DSpace Test from version 10 to 12
+  - First, I installed the new version of PostgreSQL via the Ansible playbook scripts
+  - Then I stopped Tomcat and all PostgreSQL clusters and used `pg_upgrade` to upgrade the old version:
+
+```console
+# systemctl stop tomcat7
+# pg_ctlcluster 10 main stop
+# tar -cvzpf var-lib-postgresql-10.tar.gz /var/lib/postgresql/10
+# tar -cvzpf etc-postgresql-10.tar.gz /etc/postgresql/10
+# pg_ctlcluster 12 main stop
+# pg_dropcluster 12 main
+# pg_upgradecluster 10 main
+# pg_ctlcluster 12 main start
+```
+
+- After that I [re-indexed the database indexes using a query](https://adamj.eu/tech/2021/04/13/reindexing-all-tables-after-upgrading-to-postgresql-13/):
+
+```console
+$ su - postgres
+$ cat /tmp/generate-reindex.sql
+SELECT 'REINDEX TABLE CONCURRENTLY ' || quote_ident(relname) || ' /*' || pg_size_pretty(pg_total_relation_size(C.oid)) || '*/;'
+FROM pg_class C
+LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
+WHERE nspname = 'public'
+  AND C.relkind = 'r'
+  AND nspname !~ '^pg_toast'
+ORDER BY pg_total_relation_size(C.oid) ASC;
+$ psql dspace < /tmp/generate-reindex.sql > /tmp/reindex.sql
+$ <trim the extra stuff from /tmp/reindex.sql>
+$ psql dspace < /tmp/reindex.sql
+```
+
+- I saw that the index on `metadatavalue` shrunk by about 200MB!
+- After testing a few things I dropped the old cluster:
+
+```console
+# pg_dropcluster 10 main
+# dpkg -l | grep postgresql-10 | awk '{print $2}' | xargs dpkg -r
+```
+
+## 2022-02-17
+
+- I updated my `migrate-fields.sh` script to use field names instead of IDs
+  - The script now looks up the appropriate `metadata_field_id` values for each field in the metadata registry
+
+## 2022-02-18
+
+- Normalize the `text_lang` attributes of metadata on CGSpace:
+
+```console
+dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
+ text_lang |  count  
+-----------+---------
+ en_US     | 2838588
+ en        |    1082
+           |     801
+ fr        |       2
+ vn        |       2
+ en_US.    |       1
+ sp        |       1
+           |       0
+(8 rows)
+dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en', 'en_US.', '');
+UPDATE 1884
+dspace=# UPDATE metadatavalue SET text_lang='vi' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('vn');
+UPDATE 2
+dspace=# UPDATE metadatavalue SET text_lang='es' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('sp');
+UPDATE 1
+```
+
+- I then exported the entire repository and did some cleanup on DOIs
+  - I found ~1,200 items with no `cg.identifier.doi`, but which had a DOI in their citation
+  - I cleaned up and normalized a few hundred others to use https://doi.org format
+- I'm debating using the Crossref API to search for our DOIs and improve our metadata
+  - For example: https://api.crossref.org/works/10.1016/j.ecolecon.2008.03.011
+  - There is good data on publishers, issue dates, volume/issue, and sometimes even licenses
+- I cleaned up  ~1,200 URLs that were using HTTP instead of HTTPS, fixed a bunch of handles, removed some handles from DOI field, etc
+
+## 2022-02-20
+
+- Yesterday I wrote a script to check our DOIs against Crossref's API and the did some investigation on dates, volumes, issues, pages, and types
+  - While investigating issue dates in OpenRefine I created a new column using this GREL to show the number of days between Crossref's date and ours:
+
+```console
+abs(diff(toDate(cells["issued"].value),toDate(cells["dcterms.issued[en_US]"].value), "days"))
+```
+
+- In *most* cases Crossref's dates are more correct than ours, though there are a few odd cases that I don't know what strategy I want to use yet
+- Start a full harvest on AReS
+
+## 2022-02-21
+
+- I added support for checking the license of DOIs to my Crossref script
+  - I exported ~2,800 DOIs and ran a check on them, then merged the CGSpace CSV with the results of the script to inspect in OpenRefine
+  - There are hundreds of DOIs missing licenses in our data, even in this small subset of ~2,800 (out of 19,000 on CGSpace)
+  - I spot checked a few dozen in Crossref's data and found some incorrect ones, like on Elsevier, Wiley, and Sage journals
+  - I ended up using a series of GREL expressions in OpenRefine that ended up filtering out DOIs from these prefixes:
+
+```console
+or(
+value.contains("10.1017"),
+value.contains("10.1007"),
+value.contains("10.1016"),
+value.contains("10.1098"),
+value.contains("10.1111"),
+value.contains("10.1002"),
+value.contains("10.1046"),
+value.contains("10.2135"),
+value.contains("10.1006"),
+value.contains("10.1177"),
+value.contains("10.1079"),
+value.contains("10.2298"),
+value.contains("10.1186"),
+value.contains("10.3835"),
+value.contains("10.1128"),
+value.contains("10.3732"),
+value.contains("10.2134")
+)
+```
+
+- Many many of Crossref's records are correct where we have no license, and in some cases more correct when we have a different license
+  - I ran license updates on ~167 DOIs in the end on CGSpace
+
 <!-- vim: set sw=2 ts=2: -->