Compare commits

..

2 Commits

Author SHA1 Message Date
2903ae05a0
Add notes for 2022-02-07 2022-02-07 11:39:54 +03:00
69c7f3b684
Fix notes for 2022-01 2022-02-07 09:49:34 +03:00
3 changed files with 59 additions and 7 deletions

View File

@ -109,7 +109,7 @@ $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace import --add --eperson=aort
## 2022-01-22
- Spend some time adding months to the CGIAR TAC and IWC records from Gaia
- Most of the PDFs have months so this is annoying...
- Most of the PDFs have only YYYY, so this is annoying...
## 2022-01-23

View File

@ -156,4 +156,56 @@ java.lang.NoSuchMethodError: org.apache.poi.util.LittleEndian.getUnsignedByte([B
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media.log
```
## 2022-02-04
- I found a thread on the dspace-tech mailing list about the `media-filter` crash above
- The problem is that the default filter for Word files is outdated, so we need to switch to the PoiWordFilter extractor
- After changing that I was able to filter the Word file on that item above:
```console
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -i 10568/67391 -p "Word Text Extractor" -v
The following MediaFilters are enabled:
Full Filter Name: org.dspace.app.mediafilter.PoiWordFilter
org.dspace.app.mediafilter.PoiWordFilter
File: Agreement_on_the_Estab_of_ILRI.doc.txt
FILTERED: bitstream 31db7d05-5369-4309-adeb-3b888c80b73d (item: 10568/67391) and created 'Agreement_on_the_Estab_of_ILRI.doc.txt'
```
- Meeting with the repositories working group to discuss issues moving forward in the One CGIAR
## 2022-02-07
- Gaia sent me her feedback on the duplicates for the TAC and ICW items for CGSpace a few days ago
- I used the IDs marked "delete" in her spreadsheet to create a custom text facet with this GREL in OpenRefine:
```console
or(
isNotNull(value.match('1')),
isNotNull(value.match('4')),
isNotNull(value.match('5')),
isNotNull(value.match('6')),
isNotNull(value.match('8')),
...
sNotNull(value.match('178')),
isNotNull(value.match('186')),
isNotNull(value.match('188')),
isNotNull(value.match('189')),
isNotNull(value.match('197'))
)
```
- Then I flagged all of these (seventy-five items)...
- I decided to flag the deletes instead of star the keeps because there are some items in the original file that we not marked as duplicates so we have to keep those too
- I generated the next batch of 200 items, from IDs 201 to 400, checked them for duplicates, and then added the PDF file names to the CSV for reference:
```console
$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch201-400.csv > /tmp/tac.csv
$ ./ilri/check-duplicates.py -i /tmp/tac.csv -db dspace63 -u dspacetest -p 'dom@in34sniper' -o /tmp/2022-02-07-tac-batch2-201-400.csv
$ csvcut -c id,filename ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch201-400.csv > /tmp/batch2-filenames.csv
$ csvjoin -c id /tmp/2022-02-07-tac-batch2-201-400.csv /tmp/batch2-filenames.csv > /tmp/2022-02-07-tac-batch2-201-400-filenames.csv
```
- Then I sent this second batch of items to Gaia to look at
<!-- vim: set sw=2 ts=2: -->

View File

@ -3,22 +3,22 @@
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
<lastmod>2022-02-02T23:51:22+03:00</lastmod>
<lastmod>2022-02-07T09:49:34+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2022-02-02T23:51:22+03:00</lastmod>
<lastmod>2022-02-07T09:49:34+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/2022-02/</loc>
<lastmod>2022-02-02T23:51:22+03:00</lastmod>
<lastmod>2022-02-04T08:15:52+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2022-02-02T23:51:22+03:00</lastmod>
<lastmod>2022-02-07T09:49:34+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2022-02-02T23:51:22+03:00</lastmod>
<lastmod>2022-02-07T09:49:34+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/2022-01/</loc>
<lastmod>2022-01-31T09:00:59+03:00</lastmod>
<lastmod>2022-02-07T09:49:34+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/2021-12/</loc>
<lastmod>2022-01-09T10:39:51+02:00</lastmod>