mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Update notes for 2020-07-05
This commit is contained in:
@ -232,5 +232,33 @@ $ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/sta
|
||||
- Mohammed Salem modified my [dspace-statistics-api](https://github.com/ilri/dspace-statistics-api) to query Solr directly so I started writing a script to benchmark it today
|
||||
- I will monitor the JVM memory and CPU usage in visualvm, just like I did in 2019-04
|
||||
- I noticed an issue with his limit parameter so I sent him some feedback on that in the meantime
|
||||
- I noticed that we have 20,000 distinct values for `dc.subject`, but there are at least 500 that are lower or mixed case that we should simply uppercase without further thought:
|
||||
|
||||
```
|
||||
dspace=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
|
||||
```
|
||||
|
||||
- DSpace Test needs a different query because it is running DSpace 6 with UUIDs for everything:
|
||||
|
||||
```
|
||||
dspace63=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
|
||||
```
|
||||
|
||||
- Note the use of the POSIX character class :)
|
||||
- I suggest that we generate a list of the top 5,000 values that don't match AGROVOC so that Sisay can correct them
|
||||
- Start by getting the top 6,500 subjects (assuming that the top ~1,500 are valid from our previous work):
|
||||
|
||||
```
|
||||
dspace=# \COPY (SELECT DISTINCT text_value, count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=57 GROUP BY text_value ORDER BY count DESC) TO /tmp/2020-07-05-subjects.csv WITH CSV;
|
||||
COPY 19640
|
||||
dspace=# \q
|
||||
$ csvcut -c1 /tmp/2020-07-05-subjects-upper.csv | head -n 6500 > 2020-07-05-cgspace-subjects.txt
|
||||
```
|
||||
|
||||
- Then start looking them up using `agrovoc-lookup.py`:
|
||||
|
||||
```
|
||||
$ ./agrovoc-lookup.py -i 2020-07-05-cgspace-subjects.txt -om 2020-07-05-cgspace-subjects-matched.txt -or 2020-07-05-cgspace-subjects-rejected.txt -d
|
||||
```
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user