mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-19 13:17:06 +01:00
Add notes for 2022-02-07
This commit is contained in:
parent
69c7f3b684
commit
2903ae05a0
@ -156,4 +156,56 @@ java.lang.NoSuchMethodError: org.apache.poi.util.LittleEndian.getUnsignedByte([B
|
|||||||
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media.log
|
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media.log
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## 2022-02-04
|
||||||
|
|
||||||
|
- I found a thread on the dspace-tech mailing list about the `media-filter` crash above
|
||||||
|
- The problem is that the default filter for Word files is outdated, so we need to switch to the PoiWordFilter extractor
|
||||||
|
- After changing that I was able to filter the Word file on that item above:
|
||||||
|
|
||||||
|
```console
|
||||||
|
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -i 10568/67391 -p "Word Text Extractor" -v
|
||||||
|
The following MediaFilters are enabled:
|
||||||
|
Full Filter Name: org.dspace.app.mediafilter.PoiWordFilter
|
||||||
|
org.dspace.app.mediafilter.PoiWordFilter
|
||||||
|
File: Agreement_on_the_Estab_of_ILRI.doc.txt
|
||||||
|
|
||||||
|
FILTERED: bitstream 31db7d05-5369-4309-adeb-3b888c80b73d (item: 10568/67391) and created 'Agreement_on_the_Estab_of_ILRI.doc.txt'
|
||||||
|
```
|
||||||
|
|
||||||
|
- Meeting with the repositories working group to discuss issues moving forward in the One CGIAR
|
||||||
|
|
||||||
|
## 2022-02-07
|
||||||
|
|
||||||
|
- Gaia sent me her feedback on the duplicates for the TAC and ICW items for CGSpace a few days ago
|
||||||
|
- I used the IDs marked "delete" in her spreadsheet to create a custom text facet with this GREL in OpenRefine:
|
||||||
|
|
||||||
|
```console
|
||||||
|
or(
|
||||||
|
isNotNull(value.match('1')),
|
||||||
|
isNotNull(value.match('4')),
|
||||||
|
isNotNull(value.match('5')),
|
||||||
|
isNotNull(value.match('6')),
|
||||||
|
isNotNull(value.match('8')),
|
||||||
|
...
|
||||||
|
sNotNull(value.match('178')),
|
||||||
|
isNotNull(value.match('186')),
|
||||||
|
isNotNull(value.match('188')),
|
||||||
|
isNotNull(value.match('189')),
|
||||||
|
isNotNull(value.match('197'))
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
- Then I flagged all of these (seventy-five items)...
|
||||||
|
- I decided to flag the deletes instead of star the keeps because there are some items in the original file that we not marked as duplicates so we have to keep those too
|
||||||
|
- I generated the next batch of 200 items, from IDs 201 to 400, checked them for duplicates, and then added the PDF file names to the CSV for reference:
|
||||||
|
|
||||||
|
```console
|
||||||
|
$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch201-400.csv > /tmp/tac.csv
|
||||||
|
$ ./ilri/check-duplicates.py -i /tmp/tac.csv -db dspace63 -u dspacetest -p 'dom@in34sniper' -o /tmp/2022-02-07-tac-batch2-201-400.csv
|
||||||
|
$ csvcut -c id,filename ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch201-400.csv > /tmp/batch2-filenames.csv
|
||||||
|
$ csvjoin -c id /tmp/2022-02-07-tac-batch2-201-400.csv /tmp/batch2-filenames.csv > /tmp/2022-02-07-tac-batch2-201-400-filenames.csv
|
||||||
|
```
|
||||||
|
|
||||||
|
- Then I sent this second batch of items to Gaia to look at
|
||||||
|
|
||||||
<!-- vim: set sw=2 ts=2: -->
|
<!-- vim: set sw=2 ts=2: -->
|
||||||
|
@ -3,22 +3,22 @@
|
|||||||
xmlns:xhtml="http://www.w3.org/1999/xhtml">
|
xmlns:xhtml="http://www.w3.org/1999/xhtml">
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
|
||||||
<lastmod>2022-02-02T23:51:22+03:00</lastmod>
|
<lastmod>2022-02-07T09:49:34+03:00</lastmod>
|
||||||
</url><url>
|
</url><url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
||||||
<lastmod>2022-02-02T23:51:22+03:00</lastmod>
|
<lastmod>2022-02-07T09:49:34+03:00</lastmod>
|
||||||
</url><url>
|
</url><url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/2022-02/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/2022-02/</loc>
|
||||||
<lastmod>2022-02-02T23:51:22+03:00</lastmod>
|
<lastmod>2022-02-04T08:15:52+03:00</lastmod>
|
||||||
</url><url>
|
</url><url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
|
||||||
<lastmod>2022-02-02T23:51:22+03:00</lastmod>
|
<lastmod>2022-02-07T09:49:34+03:00</lastmod>
|
||||||
</url><url>
|
</url><url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
||||||
<lastmod>2022-02-02T23:51:22+03:00</lastmod>
|
<lastmod>2022-02-07T09:49:34+03:00</lastmod>
|
||||||
</url><url>
|
</url><url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/2022-01/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/2022-01/</loc>
|
||||||
<lastmod>2022-01-31T09:00:59+03:00</lastmod>
|
<lastmod>2022-02-07T09:49:34+03:00</lastmod>
|
||||||
</url><url>
|
</url><url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/2021-12/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/2021-12/</loc>
|
||||||
<lastmod>2022-01-09T10:39:51+02:00</lastmod>
|
<lastmod>2022-01-09T10:39:51+02:00</lastmod>
|
||||||
|
Loading…
Reference in New Issue
Block a user