mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-26 00:18:21 +01:00
Add notes for 2020-10-28
This commit is contained in:
parent
5f76797488
commit
a0368b4e52
@ -760,7 +760,161 @@ $ curl -s "http://localhost:8083/solr/statistics-2018/update?softCommit=true" -H
|
||||
```
|
||||
|
||||
- Then I started processing the statistics-2017 core...
|
||||
- The processing finished with no errors and afterwards I purged 800,000 unmigrated records (all with `type: 5`):
|
||||
|
||||
```
|
||||
$ curl -s "http://localhost:8083/solr/statistics-2017/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
|
||||
```
|
||||
|
||||
- Also I purged 2.7 million unmigrated records from the statistics-2019 core
|
||||
- I filed an issue with Atmire about the duplicate values in the `owningComm` and `containerCommunity` fields in Solr: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839
|
||||
- Add new ORCID identifier for [Perle LATRE DE LATE](https://orcid.org/0000-0003-3871-6277) to controlled vocabulary
|
||||
- Use `move-collections.sh` to move a few AgriFood Tools collections on CGSpace into a new [sub community](https://hdl.handle.net/10568/109982)
|
||||
|
||||
## 2020-10-27
|
||||
|
||||
- I purged 849,408 unmigrated records from the statistics-2016 core after it finished processing...
|
||||
- I purged 285,000 unmigrated records from the statistics-2015 core after it finished processing...
|
||||
- I purged 196,000 unmigrated records from the statistics-2014 core after it finished processing...
|
||||
- I finally finished processing all the statistics cores with the `solr-upgrade-statistics-6x` utility on DSpace Test
|
||||
- I started the Atmire stats processing:
|
||||
|
||||
```
|
||||
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
|
||||
```
|
||||
|
||||
- Peter asked me to add the new preferred AGROVOC subject "covid-19" to all items we had previously added "coronavirus disease", and to make sure all items with ILRI subject "ZOONOTIC DISEASES" have the AGROVOC subject "zoonoses"
|
||||
- I exported all the records on CGSpace from the CLI and extracted the columns I needed to process them in OpenRefine:
|
||||
|
||||
```
|
||||
$ dspace metadata-export -f /tmp/cgspace.csv
|
||||
$ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US]' /tmp/cgspace.csv > /tmp/cgspace-subjects.csv
|
||||
```
|
||||
|
||||
- I sanity checked the CSV in `csv-metadata-quality` after exporting from OpenRefine, then applied the changes to 453 items on CGSpace
|
||||
- Skype with Peter and Abenet about CGSpace Explorer (AReS)
|
||||
- They want to do a big push in ILRI and our partners to use it in mid November (around 16th) so we need to clean up the metadata and try to fix the views/downloads issue by then
|
||||
- I filed [an issue](https://github.com/ilri/OpenRXV/issues/45) on OpenRXV for the views/downloads
|
||||
- We also talked about harvesting CIMMYT's repository into AReS, perhaps with only a subset of their data, though they seem to have some issues with their data:
|
||||
- dc.contributor.author and dcterms.creator
|
||||
- dc.title and dcterms.title
|
||||
- dc.region.focus
|
||||
- dc.coverage.countryfocus
|
||||
- dc.rights.accesslevel (access status)
|
||||
- dc.source.journal (source)
|
||||
- dcterms.type and dc.type
|
||||
- dc.subject.agrovoc
|
||||
- I did some work on my previous `create-mappings.py` script to process journal titles and sponsors/investors as well as CRPs and affiliations
|
||||
- I converted it to use the Elasticsearch scroll API directly rather than consuming a JSON file
|
||||
- The result is about 1200 mappings, mostly to remove acronyms at the end of metadata values
|
||||
- I added a few custom mappings using `convert-mapping.py` and then uploaded them to AReS:
|
||||
|
||||
```
|
||||
$ ./create-mappings.py > /tmp/elasticsearch-mappings.txt
|
||||
$ ./convert-mapping.py >> /tmp/elasticsearch-mappings.txt
|
||||
$ curl -XDELETE http://localhost:9200/openrxv-values
|
||||
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/elasticsearch-mappings.txt
|
||||
```
|
||||
|
||||
- After that I had to manually create and delete a fake mapping in the AReS UI so that the mappings would show up
|
||||
- I fixed a few strings in the OpenRXV admin dashboard and then re-built the frontent container:
|
||||
|
||||
```
|
||||
$ docker-compose up --build -d angular_nginx
|
||||
```
|
||||
|
||||
## 2020-10-28
|
||||
|
||||
- Fix a handful more of grammar and spelling issues in OpenRXV and then re-build the containers:
|
||||
|
||||
```
|
||||
$ docker-compose up --build -d --force-recreate angular_nginx
|
||||
```
|
||||
|
||||
- Also, I realized that the mysterious issue with countries getting changed to inconsistent lower case like "Burkina faso" is due to the country formatter (see: `backend/src/harvester/consumers/fetch.consumer.ts`)
|
||||
- I don't understand Typescript syntax so for now I will just disable that formatter in each repository configuration and I'm sure it will be better, as we're all using title case like "Kenya" and "Burkina Faso" now anyways
|
||||
- Also, I fixed a few mappings with WorldFish data
|
||||
- Peter really wants us to move forward with the alignment of our regions to UN M.49, and the CKM web team hasn't responded to any of the mails we've sent recently so I will just do it
|
||||
- These are the changes that will happen in the input forms:
|
||||
- East Africa → Eastern Africa
|
||||
- West Africa → Western Africa
|
||||
- Southeast Asia → South-eastern Asia
|
||||
- South Asia → Southern Asia
|
||||
- Africa South of Sahara → Sub-Saharan Africa
|
||||
- North Africa → Northern Africa
|
||||
- West Asia → Western Asia
|
||||
- There are some regions we use that are not present, for example Sahel, ACP, Middle East, and West and Central Africa. I will advocate for closer alignment later
|
||||
- I ran my `fix-metadata-values.py` script to update the values in the database:
|
||||
|
||||
```
|
||||
$ cat 2020-10-28-update-regions.csv
|
||||
cg.coverage.region,correct
|
||||
East Africa,Eastern Africa
|
||||
West Africa,Western Africa
|
||||
Southeast Asia,South-eastern Asia
|
||||
South Asia,Southern Asia
|
||||
Africa South Of Sahara,Sub-Saharan Africa
|
||||
North Africa,Northern Africa
|
||||
West Asia,Western Asia
|
||||
$ ./fix-metadata-values.py -i 2020-10-28-update-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227 -d
|
||||
```
|
||||
|
||||
- Then I started a full Discovery re-indexing:
|
||||
|
||||
```console
|
||||
$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||||
|
||||
real 92m14.294s
|
||||
user 7m59.840s
|
||||
sys 2m22.327s
|
||||
```
|
||||
|
||||
- I realized I had been using an incorrect Solr query to purge unmigrated items after processing with `solr-upgrade-statistics-6x`...
|
||||
- Instead of this: `(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)`
|
||||
- I should have used this: `id:/.+-unmigrated/`
|
||||
- Or perhaps this (with a check first!): `*:* NOT id:/.{36}/`
|
||||
- We need to make sure to explicitly purge the unmigrated records, then purge any that are not matching the UUID pattern (after inspecting manually!)
|
||||
- There are still 3.7 million records in our ten years of Solr statistics that are unmigrated (I only noticed because the DSpace Statistics API indexer kept failing)
|
||||
- I don't think this is serious enough to re-start the simulation of the DSpace 6 migration over again, but I definitely want to make sure I use the correct query when I do CGSpace
|
||||
- The AReS indexing finished after I removed the country formatting from all the repository configurations and now I see values like "SA", "CA", etc...
|
||||
- So really we need this to fix MELSpace countries, so I will re-enable the country formatting for their repository
|
||||
- Send Peter a list of affiliations, authors, journals, publishers, investors, and series for correction:
|
||||
|
||||
```
|
||||
dspace=> \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-affiliations.csv WITH CSV HEADER;
|
||||
COPY 6357
|
||||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-sponsors.csv WITH CSV HEADER;
|
||||
COPY 730
|
||||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-authors.csv WITH CSV HEADER;
|
||||
COPY 71748
|
||||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.publisher", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 39 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-publishers.csv WITH CSV HEADER;
|
||||
COPY 3882
|
||||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.source", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 55 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-journal-titles.csv WITH CSV HEADER;
|
||||
COPY 3684
|
||||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.relation.ispartofseries", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 43 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-series.csv WITH CSV HEADER;
|
||||
COPY 5598
|
||||
```
|
||||
|
||||
- I noticed there are still some mapping for acronyms and other fixes that haven't been applied, so I ran my `create-mappings.py` script against Elasticsearch again
|
||||
- Now I'm comparing yesterday's mappings with today's and I don't see any duplicates...
|
||||
|
||||
```
|
||||
$ grep -c '"find"' /tmp/elasticsearch-mappings*
|
||||
/tmp/elasticsearch-mappings2.txt:350
|
||||
/tmp/elasticsearch-mappings.txt:1228
|
||||
$ cat /tmp/elasticsearch-mappings* | grep -v '{"index":{}}' | wc -l
|
||||
1578
|
||||
$ cat /tmp/elasticsearch-mappings* | grep -v '{"index":{}}' | sort | uniq | wc -l
|
||||
1578
|
||||
```
|
||||
|
||||
- I have no idea why they wouldn't have been caught yesterday when I originally ran the script on a clean AReS with no mappings...
|
||||
- In any case, I combined the mappings and then uploaded them to AReS:
|
||||
|
||||
```console
|
||||
$ cat /tmp/elasticsearch-mappings* > /tmp/new-elasticsearch-mappings.txt
|
||||
$ curl -XDELETE http://localhost:9200/openrxv-values
|
||||
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/new-elasticsearch-mappings.txt
|
||||
```
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
@ -23,7 +23,7 @@ During the FlywayDB migration I got an error:
|
||||
<meta property="og:type" content="article" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-10/" />
|
||||
<meta property="article:published_time" content="2020-10-06T16:55:54+03:00" />
|
||||
<meta property="article:modified_time" content="2020-10-24T22:23:06+03:00" />
|
||||
<meta property="article:modified_time" content="2020-10-26T16:34:45+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="October, 2020"/>
|
||||
@ -51,9 +51,9 @@ During the FlywayDB migration I got an error:
|
||||
"@type": "BlogPosting",
|
||||
"headline": "October, 2020",
|
||||
"url": "https://alanorth.github.io/cgspace-notes/2020-10/",
|
||||
"wordCount": "5014",
|
||||
"wordCount": "6262",
|
||||
"datePublished": "2020-10-06T16:55:54+03:00",
|
||||
"dateModified": "2020-10-24T22:23:06+03:00",
|
||||
"dateModified": "2020-10-26T16:34:45+03:00",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
@ -975,11 +975,180 @@ java.lang.OutOfMemoryError: Java heap space
|
||||
</ul>
|
||||
<pre><code>$ curl -s "http://localhost:8083/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
|
||||
</code></pre><ul>
|
||||
<li>Then I started processing the statistics-2017 core…</li>
|
||||
<li>Then I started processing the statistics-2017 core…
|
||||
<ul>
|
||||
<li>The processing finished with no errors and afterwards I purged 800,000 unmigrated records (all with <code>type: 5</code>):</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -s "http://localhost:8083/solr/statistics-2017/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
|
||||
</code></pre><ul>
|
||||
<li>Also I purged 2.7 million unmigrated records from the statistics-2019 core</li>
|
||||
<li>I filed an issue with Atmire about the duplicate values in the <code>owningComm</code> and <code>containerCommunity</code> fields in Solr: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839</a></li>
|
||||
<li>Add new ORCID identifier for <a href="https://orcid.org/0000-0003-3871-6277">Perle LATRE DE LATE</a> to controlled vocabulary</li>
|
||||
<li>Use <code>move-collections.sh</code> to move a few AgriFood Tools collections on CGSpace into a new <a href="https://hdl.handle.net/10568/109982">sub community</a></li>
|
||||
</ul>
|
||||
<!-- raw HTML omitted -->
|
||||
<h2 id="2020-10-27">2020-10-27</h2>
|
||||
<ul>
|
||||
<li>I purged 849,408 unmigrated records from the statistics-2016 core after it finished processing…</li>
|
||||
<li>I purged 285,000 unmigrated records from the statistics-2015 core after it finished processing…</li>
|
||||
<li>I purged 196,000 unmigrated records from the statistics-2014 core after it finished processing…</li>
|
||||
<li>I finally finished processing all the statistics cores with the <code>solr-upgrade-statistics-6x</code> utility on DSpace Test
|
||||
<ul>
|
||||
<li>I started the Atmire stats processing:</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
|
||||
</code></pre><ul>
|
||||
<li>Peter asked me to add the new preferred AGROVOC subject “covid-19” to all items we had previously added “coronavirus disease”, and to make sure all items with ILRI subject “ZOONOTIC DISEASES” have the AGROVOC subject “zoonoses”
|
||||
<ul>
|
||||
<li>I exported all the records on CGSpace from the CLI and extracted the columns I needed to process them in OpenRefine:</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ dspace metadata-export -f /tmp/cgspace.csv
|
||||
$ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US]' /tmp/cgspace.csv > /tmp/cgspace-subjects.csv
|
||||
</code></pre><ul>
|
||||
<li>I sanity checked the CSV in <code>csv-metadata-quality</code> after exporting from OpenRefine, then applied the changes to 453 items on CGSpace</li>
|
||||
<li>Skype with Peter and Abenet about CGSpace Explorer (AReS)
|
||||
<ul>
|
||||
<li>They want to do a big push in ILRI and our partners to use it in mid November (around 16th) so we need to clean up the metadata and try to fix the views/downloads issue by then</li>
|
||||
<li>I filed <a href="https://github.com/ilri/OpenRXV/issues/45">an issue</a> on OpenRXV for the views/downloads</li>
|
||||
<li>We also talked about harvesting CIMMYT’s repository into AReS, perhaps with only a subset of their data, though they seem to have some issues with their data:
|
||||
<ul>
|
||||
<li>dc.contributor.author and dcterms.creator</li>
|
||||
<li>dc.title and dcterms.title</li>
|
||||
<li>dc.region.focus</li>
|
||||
<li>dc.coverage.countryfocus</li>
|
||||
<li>dc.rights.accesslevel (access status)</li>
|
||||
<li>dc.source.journal (source)</li>
|
||||
<li>dcterms.type and dc.type</li>
|
||||
<li>dc.subject.agrovoc</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>I did some work on my previous <code>create-mappings.py</code> script to process journal titles and sponsors/investors as well as CRPs and affiliations
|
||||
<ul>
|
||||
<li>I converted it to use the Elasticsearch scroll API directly rather than consuming a JSON file</li>
|
||||
<li>The result is about 1200 mappings, mostly to remove acronyms at the end of metadata values</li>
|
||||
<li>I added a few custom mappings using <code>convert-mapping.py</code> and then uploaded them to AReS:</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ ./create-mappings.py > /tmp/elasticsearch-mappings.txt
|
||||
$ ./convert-mapping.py >> /tmp/elasticsearch-mappings.txt
|
||||
$ curl -XDELETE http://localhost:9200/openrxv-values
|
||||
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/elasticsearch-mappings.txt
|
||||
</code></pre><ul>
|
||||
<li>After that I had to manually create and delete a fake mapping in the AReS UI so that the mappings would show up</li>
|
||||
<li>I fixed a few strings in the OpenRXV admin dashboard and then re-built the frontent container:</li>
|
||||
</ul>
|
||||
<pre><code>$ docker-compose up --build -d angular_nginx
|
||||
</code></pre><h2 id="2020-10-28">2020-10-28</h2>
|
||||
<ul>
|
||||
<li>Fix a handful more of grammar and spelling issues in OpenRXV and then re-build the containers:</li>
|
||||
</ul>
|
||||
<pre><code>$ docker-compose up --build -d --force-recreate angular_nginx
|
||||
</code></pre><ul>
|
||||
<li>Also, I realized that the mysterious issue with countries getting changed to inconsistent lower case like “Burkina faso” is due to the country formatter (see: <code>backend/src/harvester/consumers/fetch.consumer.ts</code>)
|
||||
<ul>
|
||||
<li>I don’t understand Typescript syntax so for now I will just disable that formatter in each repository configuration and I’m sure it will be better, as we’re all using title case like “Kenya” and “Burkina Faso” now anyways</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>Also, I fixed a few mappings with WorldFish data</li>
|
||||
<li>Peter really wants us to move forward with the alignment of our regions to UN M.49, and the CKM web team hasn’t responded to any of the mails we’ve sent recently so I will just do it
|
||||
<ul>
|
||||
<li>These are the changes that will happen in the input forms:
|
||||
<ul>
|
||||
<li>East Africa → Eastern Africa</li>
|
||||
<li>West Africa → Western Africa</li>
|
||||
<li>Southeast Asia → South-eastern Asia</li>
|
||||
<li>South Asia → Southern Asia</li>
|
||||
<li>Africa South of Sahara → Sub-Saharan Africa</li>
|
||||
<li>North Africa → Northern Africa</li>
|
||||
<li>West Asia → Western Asia</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>There are some regions we use that are not present, for example Sahel, ACP, Middle East, and West and Central Africa. I will advocate for closer alignment later</li>
|
||||
<li>I ran my <code>fix-metadata-values.py</code> script to update the values in the database:</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ cat 2020-10-28-update-regions.csv
|
||||
cg.coverage.region,correct
|
||||
East Africa,Eastern Africa
|
||||
West Africa,Western Africa
|
||||
Southeast Asia,South-eastern Asia
|
||||
South Asia,Southern Asia
|
||||
Africa South Of Sahara,Sub-Saharan Africa
|
||||
North Africa,Northern Africa
|
||||
West Asia,Western Asia
|
||||
$ ./fix-metadata-values.py -i 2020-10-28-update-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227 -d
|
||||
</code></pre><ul>
|
||||
<li>Then I started a full Discovery re-indexing:</li>
|
||||
</ul>
|
||||
<pre><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||||
|
||||
real 92m14.294s
|
||||
user 7m59.840s
|
||||
sys 2m22.327s
|
||||
</code></pre><ul>
|
||||
<li>I realized I had been using an incorrect Solr query to purge unmigrated items after processing with <code>solr-upgrade-statistics-6x</code>…
|
||||
<ul>
|
||||
<li>Instead of this: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
|
||||
<li>I should have used this: <code>id:/.+-unmigrated/</code></li>
|
||||
<li>Or perhaps this (with a check first!): <code>*:* NOT id:/.{36}/</code></li>
|
||||
<li>We need to make sure to explicitly purge the unmigrated records, then purge any that are not matching the UUID pattern (after inspecting manually!)</li>
|
||||
<li>There are still 3.7 million records in our ten years of Solr statistics that are unmigrated (I only noticed because the DSpace Statistics API indexer kept failing)</li>
|
||||
<li>I don’t think this is serious enough to re-start the simulation of the DSpace 6 migration over again, but I definitely want to make sure I use the correct query when I do CGSpace</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>The AReS indexing finished after I removed the country formatting from all the repository configurations and now I see values like “SA”, “CA”, etc…
|
||||
<ul>
|
||||
<li>So really we need this to fix MELSpace countries, so I will re-enable the country formatting for their repository</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>Send Peter a list of affiliations, authors, journals, publishers, investors, and series for correction:</li>
|
||||
</ul>
|
||||
<pre><code>dspace=> \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-affiliations.csv WITH CSV HEADER;
|
||||
COPY 6357
|
||||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-sponsors.csv WITH CSV HEADER;
|
||||
COPY 730
|
||||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-authors.csv WITH CSV HEADER;
|
||||
COPY 71748
|
||||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.publisher", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 39 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-publishers.csv WITH CSV HEADER;
|
||||
COPY 3882
|
||||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.source", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 55 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-journal-titles.csv WITH CSV HEADER;
|
||||
COPY 3684
|
||||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.relation.ispartofseries", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 43 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-series.csv WITH CSV HEADER;
|
||||
COPY 5598
|
||||
</code></pre><ul>
|
||||
<li>I noticed there are still some mapping for acronyms and other fixes that haven’t been applied, so I ran my <code>create-mappings.py</code> script against Elasticsearch again
|
||||
<ul>
|
||||
<li>Now I’m comparing yesterday’s mappings with today’s and I don’t see any duplicates…</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ grep -c '"find"' /tmp/elasticsearch-mappings*
|
||||
/tmp/elasticsearch-mappings2.txt:350
|
||||
/tmp/elasticsearch-mappings.txt:1228
|
||||
$ cat /tmp/elasticsearch-mappings* | grep -v '{"index":{}}' | wc -l
|
||||
1578
|
||||
$ cat /tmp/elasticsearch-mappings* | grep -v '{"index":{}}' | sort | uniq | wc -l
|
||||
1578
|
||||
</code></pre><ul>
|
||||
<li>I have no idea why they wouldn’t have been caught yesterday when I originally ran the script on a clean AReS with no mappings…
|
||||
<ul>
|
||||
<li>In any case, I combined the mappings and then uploaded them to AReS:</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code class="language-console" data-lang="console">$ cat /tmp/elasticsearch-mappings* > /tmp/new-elasticsearch-mappings.txt
|
||||
$ curl -XDELETE http://localhost:9200/openrxv-values
|
||||
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/new-elasticsearch-mappings.txt
|
||||
</code></pre><!-- raw HTML omitted -->
|
||||
|
||||
|
||||
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/" />
|
||||
<meta property="og:updated_time" content="2020-10-24T22:23:06+03:00" />
|
||||
<meta property="og:updated_time" content="2020-10-26T16:34:45+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="Categories"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
|
||||
<meta property="og:updated_time" content="2020-10-24T22:23:06+03:00" />
|
||||
<meta property="og:updated_time" content="2020-10-26T16:34:45+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="Notes"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
|
||||
<meta property="og:updated_time" content="2020-10-24T22:23:06+03:00" />
|
||||
<meta property="og:updated_time" content="2020-10-26T16:34:45+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="Notes"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
|
||||
<meta property="og:updated_time" content="2020-10-24T22:23:06+03:00" />
|
||||
<meta property="og:updated_time" content="2020-10-26T16:34:45+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="Notes"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
|
||||
<meta property="og:updated_time" content="2020-10-24T22:23:06+03:00" />
|
||||
<meta property="og:updated_time" content="2020-10-26T16:34:45+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="Notes"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2020-10-24T22:23:06+03:00" />
|
||||
<meta property="og:updated_time" content="2020-10-26T16:34:45+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="CGSpace Notes"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2020-10-24T22:23:06+03:00" />
|
||||
<meta property="og:updated_time" content="2020-10-26T16:34:45+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="CGSpace Notes"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2020-10-24T22:23:06+03:00" />
|
||||
<meta property="og:updated_time" content="2020-10-26T16:34:45+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="CGSpace Notes"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2020-10-24T22:23:06+03:00" />
|
||||
<meta property="og:updated_time" content="2020-10-26T16:34:45+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="CGSpace Notes"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2020-10-24T22:23:06+03:00" />
|
||||
<meta property="og:updated_time" content="2020-10-26T16:34:45+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="CGSpace Notes"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2020-10-24T22:23:06+03:00" />
|
||||
<meta property="og:updated_time" content="2020-10-26T16:34:45+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="CGSpace Notes"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2020-10-24T22:23:06+03:00" />
|
||||
<meta property="og:updated_time" content="2020-10-26T16:34:45+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="CGSpace Notes"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2020-10-24T22:23:06+03:00" />
|
||||
<meta property="og:updated_time" content="2020-10-26T16:34:45+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="Posts"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2020-10-24T22:23:06+03:00" />
|
||||
<meta property="og:updated_time" content="2020-10-26T16:34:45+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="Posts"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2020-10-24T22:23:06+03:00" />
|
||||
<meta property="og:updated_time" content="2020-10-26T16:34:45+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="Posts"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2020-10-24T22:23:06+03:00" />
|
||||
<meta property="og:updated_time" content="2020-10-26T16:34:45+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="Posts"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2020-10-24T22:23:06+03:00" />
|
||||
<meta property="og:updated_time" content="2020-10-26T16:34:45+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="Posts"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2020-10-24T22:23:06+03:00" />
|
||||
<meta property="og:updated_time" content="2020-10-26T16:34:45+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="Posts"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2020-10-24T22:23:06+03:00" />
|
||||
<meta property="og:updated_time" content="2020-10-26T16:34:45+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="Posts"/>
|
||||
|
@ -4,27 +4,27 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
|
||||
<lastmod>2020-10-24T22:23:06+03:00</lastmod>
|
||||
<lastmod>2020-10-26T16:34:45+03:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
||||
<lastmod>2020-10-24T22:23:06+03:00</lastmod>
|
||||
<lastmod>2020-10-26T16:34:45+03:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
|
||||
<lastmod>2020-10-24T22:23:06+03:00</lastmod>
|
||||
<lastmod>2020-10-26T16:34:45+03:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/2020-10/</loc>
|
||||
<lastmod>2020-10-24T22:23:06+03:00</lastmod>
|
||||
<lastmod>2020-10-26T16:34:45+03:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
||||
<lastmod>2020-10-24T22:23:06+03:00</lastmod>
|
||||
<lastmod>2020-10-26T16:34:45+03:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
|
Loading…
Reference in New Issue
Block a user