mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2022-06-24
This commit is contained in:
@ -145,4 +145,55 @@ $ grep -c 'Adding ORCID' /tmp/orcids2.log
|
||||
- Meeting with Salem to discuss metadata between CGSpace and MEL
|
||||
- We started working through his spreadsheet and then the Internet dropped
|
||||
|
||||
## 2022-06-23
|
||||
|
||||
- Start looking at country names between MEL, CGSpace, and standards like UN M.49 and GeoNames
|
||||
- I used `xmllint` to extract the countries from CGSpace's input forms:
|
||||
|
||||
```console
|
||||
$ xmllint --xpath '//value-pairs[@value-pairs-name="countrylist"]/pair/stored-value/node()' dspace/config/input-forms.xml > /tmp/cgspace-countries.txt
|
||||
```
|
||||
|
||||
- Then I wrote a Python script (`countries-to-csv.py`) to read them and save their names alongside the ISO 3166-1 Alpha2 code
|
||||
- Then I joined them with the other lists:
|
||||
|
||||
```console
|
||||
$ csvjoin --outer -c alpha2 ~/Downloads/clarisa-countries.csv ~/Downloads/UNSD\ —\ Methodology.csv ~/Downloads/geonames-countries.csv /tmp/cgspace-countries.csv /tmp/mel-countries.csv> /tmp/countries.csv
|
||||
```
|
||||
|
||||
- This mostly worked fine, and is much easier than writing another Python script with Pandas...
|
||||
|
||||
## 2022-06-24
|
||||
|
||||
- Spent some more time working on my `countries-to-csv.py` script to fix some logic errors
|
||||
- Then re-export the UN M.49 countries to a clean list because the one I did yesterday somehow has errors:
|
||||
|
||||
```console
|
||||
csvcut -d ';' -c 'ISO-alpha2 Code,Country or Area' ~/Downloads/UNSD\ —\ Methodology.csv | sed -e '1s/ISO-alpha2 Code/alpha2/' -e '1s/Country or Area/UN M.49 Name/' > ~/Downloads/un-countries.csv
|
||||
```
|
||||
|
||||
- Check the number of lines in each file:
|
||||
|
||||
```
|
||||
$ wc -l clarisa-countries.csv un-countries.csv cgspace-countries.csv mel-countries.csv
|
||||
250 clarisa-countries.csv
|
||||
250 un-countries.csv
|
||||
198 cgspace-countries.csv
|
||||
258 mel-countries.csv
|
||||
```
|
||||
|
||||
- I am seeing strange results with csvjoin's `--outer` join that I need to keep unmatched terms from both left and right files...
|
||||
- Using `xsv join --full` is giving me better results:
|
||||
|
||||
```
|
||||
$ xsv join --full alpha2 ~/Downloads/clarisa-countries.csv alpha2 ~/Downloads/un-countries.csv | xsv select '!alpha2[1]' > /tmp/clarisa-un-xsv-full.csv
|
||||
```
|
||||
|
||||
- Then adding the CGSpace and MEL countries:
|
||||
|
||||
```console
|
||||
$ xsv join --full alpha2 /tmp/clarisa-un-xsv-full.csv alpha2 /tmp/cgspace-countries.csv | xsv select '!alpha2[1]' > /tmp/clarisa-un-cgspace-xsv-full.csv
|
||||
$ xsv join --full alpha2 /tmp/clarisa-un-cgspace-xsv-full.csv alpha2 /tmp/mel-countries.csv | xsv select '!alpha2[1]' > /tmp/clarisa-un-cgspace-mel-xsv-full.csv
|
||||
```
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user