mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-12-23 05:32:20 +01:00
458 lines
18 KiB
HTML
458 lines
18 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en-us">
|
|
<head prefix="og: http://ogp.me/ns#">
|
|
<meta charset="utf-8" />
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1" />
|
|
<meta property="og:title" content=" September, 2016 · CGSpace Notes" />
|
|
|
|
<meta property="og:site_name" content="CGSpace Notes" />
|
|
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2016-09/" />
|
|
|
|
|
|
<meta property="og:type" content="article" />
|
|
|
|
<meta property="og:article:published_time" content="2016-09-01T15:53:00+03:00" />
|
|
|
|
<meta property="og:article:tag" content="notes" />
|
|
|
|
|
|
|
|
<title>
|
|
September, 2016 · CGSpace Notes
|
|
</title>
|
|
|
|
<link rel="stylesheet" href="https://alanorth.github.io/cgspace-notes/css/bootstrap.min.css" />
|
|
<link rel="stylesheet" href="https://alanorth.github.io/cgspace-notes/css/main.css" />
|
|
<link rel="stylesheet" href="https://alanorth.github.io/cgspace-notes/css/font-awesome.min.css" />
|
|
<link rel="stylesheet" href="https://alanorth.github.io/cgspace-notes/css/github.css" />
|
|
<link rel="stylesheet" href="//fonts.googleapis.com/css?family=Source+Sans+Pro:200,300,400" type="text/css">
|
|
<link rel="shortcut icon" href="https://alanorth.github.io/cgspace-notes/images/favicon.ico" />
|
|
<link rel="apple-touch-icon" href="https://alanorth.github.io/cgspace-notes/images/apple-touch-icon.png" />
|
|
|
|
</head>
|
|
<body>
|
|
<header class="global-header" style="background-image:url(../images/bg.jpg )">
|
|
<section class="header-text">
|
|
<h1><a href="https://alanorth.github.io/cgspace-notes/">CGSpace Notes</a></h1>
|
|
|
|
<div class="sns-links hidden-print">
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
</div>
|
|
|
|
|
|
<a href="https://alanorth.github.io/cgspace-notes/" class="btn-header btn-back hidden-xs">
|
|
<i class="fa fa-angle-left" aria-hidden="true"></i>
|
|
Home
|
|
</a>
|
|
|
|
|
|
</section>
|
|
</header>
|
|
<main class="container">
|
|
|
|
|
|
<article>
|
|
<header>
|
|
<h1 class="text-primary">September, 2016</h1>
|
|
<div class="post-meta clearfix">
|
|
<div class="post-date pull-left">
|
|
Posted on
|
|
<time datetime="2016-09-01T15:53:00+03:00">
|
|
Sep 1, 2016
|
|
</time>
|
|
</div>
|
|
<div class="pull-right">
|
|
|
|
<span class="post-tag small"><a href="https://alanorth.github.io/cgspace-notes//tags/notes">#notes</a></span>
|
|
|
|
</div>
|
|
</div>
|
|
</header>
|
|
<section>
|
|
|
|
|
|
<h2 id="2016-09-01">2016-09-01</h2>
|
|
|
|
<ul>
|
|
<li>Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors</li>
|
|
<li>Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace</li>
|
|
<li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li>
|
|
<li>It looks like we might be able to use OUs now, instead of DCs:</li>
|
|
</ul>
|
|
|
|
<pre><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>User who has been migrated to the root vs user still in the hierarchical structure:</li>
|
|
</ul>
|
|
|
|
<pre><code>distinguishedName: CN=Last\, First (ILRI),OU=ILRI Kenya Employees,OU=ILRI Kenya,OU=ILRIHUB,DC=CGIARAD,DC=ORG
|
|
distinguishedName: CN=Last\, First (ILRI),OU=ILRI Ethiopia Employees,OU=ILRI Ethiopia,DC=ILRI,DC=CGIARAD,DC=ORG
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Changing the DSpace LDAP config to use <code>OU=ILRIHUB</code> seems to work:</li>
|
|
</ul>
|
|
|
|
<p><img src="../images/2016/09/ilri-ldap-users.png" alt="DSpace groups based on LDAP DN" /></p>
|
|
|
|
<ul>
|
|
<li>Notes for local PostgreSQL database recreation from production snapshot:</li>
|
|
</ul>
|
|
|
|
<pre><code>$ dropdb dspacetest
|
|
$ createdb -O dspacetest --encoding=UNICODE dspacetest
|
|
$ psql dspacetest -c 'alter user dspacetest createuser;'
|
|
$ pg_restore -O -U dspacetest -d dspacetest ~/Downloads/cgspace_2016-09-01.backup
|
|
$ psql dspacetest -c 'alter user dspacetest nocreateuser;'
|
|
$ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest -h localhost
|
|
$ vacuumdb dspacetest
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Some names that I thought I fixed in July seem not to be:</li>
|
|
</ul>
|
|
|
|
<pre><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %';
|
|
text_value | authority | confidence
|
|
-----------------------+--------------------------------------+------------
|
|
Poole, Elizabeth Jane | b6efa27f-8829-4b92-80fe-bc63e03e3ccb | 600
|
|
Poole, Elizabeth Jane | 41628f42-fc38-4b38-b473-93aec9196326 | 600
|
|
Poole, Elizabeth Jane | 83b82da0-f652-4ebc-babc-591af1697919 | 600
|
|
Poole, Elizabeth Jane | c3a22456-8d6a-41f9-bba0-de51ef564d45 | 600
|
|
Poole, E.J. | c3a22456-8d6a-41f9-bba0-de51ef564d45 | 600
|
|
Poole, E.J. | 0fbd91b9-1b71-4504-8828-e26885bf8b84 | 600
|
|
(6 rows)
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>At least a few of these actually have the correct ORCID, but I will unify the authority to be c3a22456-8d6a-41f9-bba0-de51ef564d45</li>
|
|
</ul>
|
|
|
|
<pre><code>dspacetest=# update metadatavalue set authority='c3a22456-8d6a-41f9-bba0-de51ef564d45', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %';
|
|
UPDATE 69
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>And for Peter Ballantyne:</li>
|
|
</ul>
|
|
|
|
<pre><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %';
|
|
text_value | authority | confidence
|
|
-------------------+--------------------------------------+------------
|
|
Ballantyne, Peter | 2dcbcc7b-47b0-4fd7-bef9-39d554494081 | 600
|
|
Ballantyne, Peter | 4f04ca06-9a76-4206-bd9c-917ca75d278e | 600
|
|
Ballantyne, P.G. | 4f04ca06-9a76-4206-bd9c-917ca75d278e | 600
|
|
Ballantyne, Peter | ba5f205b-b78b-43e5-8e80-0c9a1e1ad2ca | 600
|
|
Ballantyne, Peter | 20f21160-414c-4ecf-89ca-5f2cb64e75c1 | 600
|
|
(5 rows)
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Again, a few have the correct ORCID, but there should only be one authority…</li>
|
|
</ul>
|
|
|
|
<pre><code>dspacetest=# update metadatavalue set authority='4f04ca06-9a76-4206-bd9c-917ca75d278e', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %';
|
|
UPDATE 58
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>And for me:</li>
|
|
</ul>
|
|
|
|
<pre><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Orth, A%';
|
|
text_value | authority | confidence
|
|
------------+--------------------------------------+------------
|
|
Orth, Alan | 4884def0-4d7e-4256-9dd4-018cd60a5871 | 600
|
|
Orth, A. | 4884def0-4d7e-4256-9dd4-018cd60a5871 | 600
|
|
Orth, A. | 1a1943a0-3f87-402f-9afe-e52fb46a513e | 600
|
|
(3 rows)
|
|
dspacetest=# update metadatavalue set authority='1a1943a0-3f87-402f-9afe-e52fb46a513e', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Orth, %';
|
|
UPDATE 11
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>And for CCAFS author Bruce Campbell that I had discussed with CCAFS earlier this week:</li>
|
|
</ul>
|
|
|
|
<pre><code>dspacetest=# update metadatavalue set authority='0e414b4c-4671-4a23-b570-6077aca647d8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Campbell, B%';
|
|
UPDATE 166
|
|
dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Campbell, B%';
|
|
text_value | authority | confidence
|
|
------------------------+--------------------------------------+------------
|
|
Campbell, Bruce | 0e414b4c-4671-4a23-b570-6077aca647d8 | 600
|
|
Campbell, Bruce Morgan | 0e414b4c-4671-4a23-b570-6077aca647d8 | 600
|
|
Campbell, B. | 0e414b4c-4671-4a23-b570-6077aca647d8 | 600
|
|
Campbell, B.M. | 0e414b4c-4671-4a23-b570-6077aca647d8 | 600
|
|
(4 rows)
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>After updating the Authority indexes (<code>bin/dspace index-authority</code>) everything looks good</li>
|
|
<li>Run authority updates on CGSpace</li>
|
|
</ul>
|
|
|
|
<h2 id="2016-09-05">2016-09-05</h2>
|
|
|
|
<ul>
|
|
<li>After one week of logging TLS connections on CGSpace:</li>
|
|
</ul>
|
|
|
|
<pre><code># zgrep "DES-CBC3" /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
|
|
217
|
|
# zcat -f -- /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
|
|
1164376
|
|
# zgrep "DES-CBC3" /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | awk '{print $6}' | sort | uniq
|
|
TLSv1/DES-CBC3-SHA
|
|
TLSv1/EDH-RSA-DES-CBC3-SHA
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>So this represents <code>0.02%</code> of 1.16M connections over a one-week period</li>
|
|
<li>Transforming some filenames in OpenRefine so they can have a useful description for SAFBuilder:</li>
|
|
</ul>
|
|
|
|
<pre><code>value + "__description:" + cells["dc.type"].value
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>This gives you, for example: <code>Mainstreaming gender in agricultural R&D.pdf__description:Brief</code></li>
|
|
</ul>
|
|
|
|
<h2 id="2016-09-06">2016-09-06</h2>
|
|
|
|
<ul>
|
|
<li>Trying to import the records for CIAT from yesterday, but having filename encoding issues from their zip file</li>
|
|
<li>Create a zip on Mac OS X from a SAF bundle containing only one record with one PDF:
|
|
|
|
<ul>
|
|
<li>Filename: Complementing Farmers Genetic Knowledge Farmer Breeding Workshop in Turipaná, Colombia.pdf</li>
|
|
<li>Imports fine on DSpace running on Mac OS X</li>
|
|
<li>Fails to import on DSpace running on Linux with error <code>No such file or directory</code></li>
|
|
</ul></li>
|
|
<li>Change diacritic in file name from á to a and re-create SAF bundle and zip
|
|
|
|
<ul>
|
|
<li>Success on both Mac OS X and Linux…</li>
|
|
</ul></li>
|
|
<li>Looks like on the Mac OS X file system the file names represent á as: a (U+0061) + ́ (U+0301)</li>
|
|
<li>See: <a href="http://www.fileformat.info/info/unicode/char/e1/index.htm">http://www.fileformat.info/info/unicode/char/e1/index.htm</a></li>
|
|
<li>See: <a href="http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%A1&s=&uv=0">http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%A1&s=&uv=0</a></li>
|
|
<li>If I unzip the original zip from CIAT on Windows, re-zip it with 7zip on Windows, and then unzip it on Linux directly, the file names seem to be proper UTF-8</li>
|
|
<li>We should definitely clean filenames so they don’t use characters that are tricky to process in CSV and shell scripts, like: <code>,</code>, <code>'</code>, and <code>"</code></li>
|
|
</ul>
|
|
|
|
<pre><code>value.replace("'","").replace(",","").replace('"','')
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>I need to write a Python script to match that for renaming files in the file system</li>
|
|
<li>When importing SAF bundles it seems you can specify the target collection on the command line using <code>-c 10568/4003</code> or in the <code>collections</code> file inside each item in the bundle</li>
|
|
<li>Seems that the latter method causes a null pointer exception, so I will just have to use the former method</li>
|
|
<li>In the end I was able to import the files after unzipping them ONLY on Linux
|
|
|
|
<ul>
|
|
<li>The CSV file was giving file names in UTF-8, and unzipping the zip on Mac OS X and transferring it was converting the file names to Unicode equivalence like I saw above</li>
|
|
</ul></li>
|
|
<li>Import CIAT Gender Network records to CGSpace, first creating the SAF bundles as my user, then importing as the <code>tomcat7</code> user, and deleting the bundle, for each collection’s items:</li>
|
|
</ul>
|
|
|
|
<pre><code>$ ./safbuilder.sh -c /home/aorth/ciat-gender-2016-09-06/66601.csv
|
|
$ JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m" /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/66601 -s /home/aorth/ciat-gender-2016-09-06/SimpleArchiveFormat -m 66601.map
|
|
$ rm -rf ~/ciat-gender-2016-09-06/SimpleArchiveFormat/
|
|
</code></pre>
|
|
|
|
<h2 id="2016-09-07">2016-09-07</h2>
|
|
|
|
<ul>
|
|
<li>Erase and rebuild DSpace Test based on latest Ubuntu 16.04, PostgreSQL 9.5, and Java 8 stuff</li>
|
|
<li>Reading about PostgreSQL maintenance and it seems manual vacuuming is only for certain workloads, such as heavy update/write loads</li>
|
|
<li>I suggest we disable our nightly manual vacuum task, as we’re a mostly read workload, and I’d rather stick as close to the documentation as possible since we haven’t done any testing/observation of PostgreSQL</li>
|
|
<li>See: <a href="https://www.postgresql.org/docs/9.3/static/routine-vacuuming.html">https://www.postgresql.org/docs/9.3/static/routine-vacuuming.html</a></li>
|
|
<li>CGSpace went down and the error seems to be the same as always (lately):</li>
|
|
</ul>
|
|
|
|
<pre><code>2016-09-07 11:39:23,162 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
|
|
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
|
|
...
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Since CGSpace had crashed I quickly deployed the new LDAP settings before restarting Tomcat</li>
|
|
</ul>
|
|
|
|
<h2 id="2016-09-13">2016-09-13</h2>
|
|
|
|
<ul>
|
|
<li>CGSpace crashed twice today, errors from <code>catalina.out</code>:</li>
|
|
</ul>
|
|
|
|
<pre><code>org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:114)
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>I enabled logging of requests to <code>/rest</code> again</li>
|
|
</ul>
|
|
|
|
<h2 id="2016-09-14">2016-09-14</h2>
|
|
|
|
<ul>
|
|
<li>CGSpace crashed again, errors from <code>catalina.out</code>:</li>
|
|
</ul>
|
|
|
|
<pre><code>org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
|
|
at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:114)
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>I restarted Tomcat and it was ok again</li>
|
|
<li>CGSpace crashed a few hours later, errors from <code>catalina.out</code>:</li>
|
|
</ul>
|
|
|
|
<pre><code>Exception in thread "http-bio-127.0.0.1-8081-exec-25" java.lang.OutOfMemoryError: Java heap space
|
|
at java.lang.StringCoding.decode(StringCoding.java:215)
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>We haven’t seen that in quite a while…</li>
|
|
<li>Indeed, in a month of logs it only occurs 15 times:</li>
|
|
</ul>
|
|
|
|
<pre><code># grep -rsI "OutOfMemoryError" /var/log/tomcat7/catalina.* | wc -l
|
|
15
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>I also see a bunch of errors from dspace.log:</li>
|
|
</ul>
|
|
|
|
<pre><code>2016-09-14 12:23:07,981 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
|
|
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Looking at REST requests, it seems there is one IP hitting us nonstop:</li>
|
|
</ul>
|
|
|
|
<pre><code># awk '{print $1}' /var/log/nginx/rest.log | sort -n | uniq -c | sort -h | tail -n 3
|
|
820 50.87.54.15
|
|
12872 70.32.99.142
|
|
25744 70.32.83.92
|
|
# awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 3
|
|
7966 181.118.144.29
|
|
54706 70.32.99.142
|
|
109412 70.32.83.92
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Those are the same IPs that were hitting us heavily in July, 2016 as well…</li>
|
|
<li>I think the stability issues are definitely from REST</li>
|
|
<li>Crashed AGAIN, errors from dspace.log:</li>
|
|
</ul>
|
|
|
|
<pre><code>2016-09-14 14:31:43,069 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
|
|
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>And more heap space errors:</li>
|
|
</ul>
|
|
|
|
<pre><code># grep -rsI "OutOfMemoryError" /var/log/tomcat7/catalina.* | wc -l
|
|
19
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>There are no more rest requests since the last crash, so maybe there are other things causing this.</li>
|
|
<li>Hmm, I noticed a shitload of IPs from 180.76.0.0/16 are connecting to both CGSpace and DSpace Test (58 unique IPs concurrently!)</li>
|
|
<li>They seem to be coming from Baidu, and so far during today alone account for <sup>1</sup>⁄<sub>6</sub> of every connection:</li>
|
|
</ul>
|
|
|
|
<pre><code># grep -c ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-14
|
|
29084
|
|
# grep -c ip_addr=180.76.15 /home/cgspace.cgiar.org/log/dspace.log.2016-09-14
|
|
5192
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Other recent days are the same… hmmm.</li>
|
|
<li>From the activity control panel I can see 58 unique IPs hitting the site <em>concurrently</em>, which has GOT to hurt our stability</li>
|
|
<li>A list of all 2000 unique IPs from CGSpace logs today:</li>
|
|
</ul>
|
|
|
|
<pre><code># grep ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-11 | awk -F: '{print $5}' | sort -n | uniq -c | sort -h | tail -n 100
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Looking at the top 20 IPs or so, most are Yahoo, MSN, Google, Baidu, TurnitIn (iParadigm), etc… do we have any real users?</li>
|
|
<li>Generate a list of all Affiliations for Peter Ballantyne to go through, make corrections, and create a lookup list from:</li>
|
|
</ul>
|
|
|
|
<pre><code>dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=211 group by text_value order by count desc)
|
|
to /tmp/affiliations.csv with csv;
|
|
</code></pre>
|
|
|
|
</section>
|
|
<footer>
|
|
|
|
<section class="author-info row">
|
|
<div class="author-avatar col-md-2">
|
|
|
|
</div>
|
|
<div class="author-meta col-md-6">
|
|
|
|
<h1 class="author-name text-primary">Alan Orth</h1>
|
|
|
|
|
|
</div>
|
|
|
|
</section>
|
|
<ul class="pager">
|
|
|
|
<li class="previous"><a href="https://alanorth.github.io/cgspace-notes/2016-08/"><span aria-hidden="true">←</span> Older</a></li>
|
|
|
|
|
|
<li class="next disabled"><a href="#">Newer <span aria-hidden="true">→</span></a></li>
|
|
|
|
</ul>
|
|
</footer>
|
|
</article>
|
|
|
|
</main>
|
|
<footer class="container global-footer">
|
|
<div class="copyright-note pull-left">
|
|
|
|
</div>
|
|
<div class="sns-links hidden-print">
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
</div>
|
|
|
|
</footer>
|
|
|
|
<script src="https://alanorth.github.io/cgspace-notes/js/highlight.pack.js"></script>
|
|
<script>
|
|
hljs.initHighlightingOnLoad();
|
|
</script>
|
|
|
|
|
|
</body>
|
|
</html>
|
|
|