2019-05-01 10:53:26 +02:00
<!DOCTYPE html>
< html lang = "en" >
< head >
< meta charset = "utf-8" >
< meta name = "viewport" content = "width=device-width, initial-scale=1, shrink-to-fit=no" >
< meta property = "og:title" content = "May, 2019" / >
< meta property = "og:description" content = "2019-05-01
Help CCAFS with regenerating some item thumbnails after they uploaded new PDFs to some items on CGSpace
A user on the dspace-tech mailing list offered some suggestions for troubleshooting the problem with the inability to delete certain items
Apparently if the item is in the workflowitem table it is submitted to a workflow
And if it is in the workspaceitem table it is in the pre-submitted state
2019-05-05 15:45:12 +02:00
The item seems to be in a pre-submitted state, so I tried to delete it from there:
2019-05-01 10:53:26 +02:00
dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
DELETE 1
But after this I tried to delete the item from the XMLUI and it is still present…
" />
< meta property = "og:type" content = "article" / >
< meta property = "og:url" content = "https://alanorth.github.io/cgspace-notes/2019-05/" / >
< meta property = "article:published_time" content = "2019-05-01T07:37:43+03:00" / >
2019-05-06 14:41:40 +02:00
< meta property = "article:modified_time" content = "2019-05-06T11:50:57+03:00" / >
2019-05-01 10:53:26 +02:00
< meta name = "twitter:card" content = "summary" / >
< meta name = "twitter:title" content = "May, 2019" / >
< meta name = "twitter:description" content = "2019-05-01
Help CCAFS with regenerating some item thumbnails after they uploaded new PDFs to some items on CGSpace
A user on the dspace-tech mailing list offered some suggestions for troubleshooting the problem with the inability to delete certain items
Apparently if the item is in the workflowitem table it is submitted to a workflow
And if it is in the workspaceitem table it is in the pre-submitted state
2019-05-05 15:45:12 +02:00
The item seems to be in a pre-submitted state, so I tried to delete it from there:
2019-05-01 10:53:26 +02:00
dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
DELETE 1
But after this I tried to delete the item from the XMLUI and it is still present…
"/>
2019-05-05 15:45:12 +02:00
< meta name = "generator" content = "Hugo 0.55.5" / >
2019-05-01 10:53:26 +02:00
< script type = "application/ld+json" >
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "May, 2019",
"url": "https:\/\/alanorth.github.io\/cgspace-notes\/2019-05\/",
2019-05-06 14:41:40 +02:00
"wordCount": "1511",
2019-05-01 10:53:26 +02:00
"datePublished": "2019-05-01T07:37:43\x2b03:00",
2019-05-06 14:41:40 +02:00
"dateModified": "2019-05-06T11:50:57\x2b03:00",
2019-05-01 10:53:26 +02:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
< / script >
< link rel = "canonical" href = "https://alanorth.github.io/cgspace-notes/2019-05/" >
< title > May, 2019 | CGSpace Notes< / title >
<!-- combined, minified CSS -->
< link href = "https://alanorth.github.io/cgspace-notes/css/style.css" rel = "stylesheet" integrity = "sha384-G5B34w7DFTumWTswxYzTX7NWfbvQEg1HbFFEg6ItN03uTAAoS2qkPS/fu3LhuuSA" crossorigin = "anonymous" >
<!-- RSS 2.0 feed -->
< / head >
< body >
< div class = "blog-masthead" >
< div class = "container" >
< nav class = "nav blog-nav" >
< a class = "nav-link " href = "https://alanorth.github.io/cgspace-notes/" > Home< / a >
< / nav >
< / div >
< / div >
< header class = "blog-header" >
< div class = "container" >
< h1 class = "blog-title" > < a href = "https://alanorth.github.io/cgspace-notes/" rel = "home" > CGSpace Notes< / a > < / h1 >
< p class = "lead blog-description" > Documenting day-to-day work on the < a href = "https://cgspace.cgiar.org" > CGSpace< / a > repository.< / p >
< / div >
< / header >
< div class = "container" >
< div class = "row" >
< div class = "col-sm-8 blog-main" >
< article class = "blog-post" >
< header >
< h2 class = "blog-post-title" > < a href = "https://alanorth.github.io/cgspace-notes/2019-05/" > May, 2019< / a > < / h2 >
< p class = "blog-post-meta" > < time datetime = "2019-05-01T07:37:43+03:00" > Wed May 01, 2019< / time > by Alan Orth in
< i class = "fa fa-tag" aria-hidden = "true" > < / i > < a href = "/cgspace-notes/tags/notes" rel = "tag" > Notes< / a >
< / p >
< / header >
< h2 id = "2019-05-01" > 2019-05-01< / h2 >
< ul >
< li > Help CCAFS with regenerating some item thumbnails after they uploaded new PDFs to some items on CGSpace< / li >
< li > A user on the dspace-tech mailing list offered some suggestions for troubleshooting the problem with the inability to delete certain items
< ul >
< li > Apparently if the item is in the < code > workflowitem< / code > table it is submitted to a workflow< / li >
< li > And if it is in the < code > workspaceitem< / code > table it is in the pre-submitted state< / li >
< / ul > < / li >
2019-05-05 15:45:12 +02:00
< li > < p > The item seems to be in a pre-submitted state, so I tried to delete it from there:< / p >
2019-05-01 10:53:26 +02:00
< pre > < code > dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
DELETE 1
2019-05-05 15:45:12 +02:00
< / code > < / pre > < / li >
2019-05-01 10:53:26 +02:00
2019-05-05 15:45:12 +02:00
< li > < p > But after this I tried to delete the item from the XMLUI and it is < em > still< / em > present… < / p > < / li >
2019-05-01 10:53:26 +02:00
< / ul >
< ul >
2019-05-05 15:45:12 +02:00
< li > < p > I managed to delete the problematic item from the database< / p >
2019-05-01 10:53:26 +02:00
< ul >
< li > First I deleted the item’ s bitstream in XMLUI and then ran < code > dspace cleanup -v< / code > to remove it from the assetstore< / li >
2019-05-05 15:45:12 +02:00
< li > < p > Then I ran the following SQL:< / p >
2019-05-01 10:53:26 +02:00
< pre > < code > dspace=# DELETE FROM metadatavalue WHERE resource_id=74648;
dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
dspace=# DELETE FROM item WHERE item_id=74648;
2019-05-05 15:45:12 +02:00
< / code > < / pre > < / li >
< / ul > < / li >
2019-05-01 10:53:26 +02:00
2019-05-05 15:45:12 +02:00
< li > < p > Now the item is (hopefully) really gone and I can continue to troubleshoot the issue with REST API’ s < code > /items/find-by-metadata-value< / code > endpoint< / p >
2019-05-01 10:53:26 +02:00
< ul >
2019-05-05 15:45:12 +02:00
< li > < p > Of course I run into another HTTP 401 error when I continue trying the LandPortal search from last month:< / p >
2019-05-01 10:53:26 +02:00
< pre > < code > $ curl -f -H " Content-Type: application/json" -X POST " http://localhost:8080/rest/items/find-by-metadata-field" -d '{" key" :" cg.subject.cpwf" , " value" :" WATER MANAGEMENT" ," language" : " en_US" }'
curl: (22) The requested URL returned error: 401 Unauthorized
2019-05-05 15:45:12 +02:00
< / code > < / pre > < / li >
< / ul > < / li >
2019-05-01 10:53:26 +02:00
2019-05-05 15:45:12 +02:00
< li > < p > The DSpace log shows the item ID (because I modified the error text):< / p >
2019-05-01 10:53:26 +02:00
< pre > < code > 2019-05-01 11:41:11,069 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item(id=77708)!
2019-05-05 15:45:12 +02:00
< / code > < / pre > < / li >
2019-05-01 10:53:26 +02:00
2019-05-05 15:45:12 +02:00
< li > < p > If I delete that one I get another, making the list of item IDs so far:< / p >
2019-05-01 10:53:26 +02:00
< ul >
< li > 74648< / li >
< li > 77708< / li >
< li > 85079< / li >
< / ul > < / li >
2019-05-05 15:45:12 +02:00
< li > < p > Some are in the < code > workspaceitem< / code > table (pre-submission), others are in the < code > workflowitem< / code > table (submitted), and others are actually approved, but withdrawn… < / p >
2019-05-01 10:53:26 +02:00
< ul >
< li > This is actually a worthless exercise because the real issue is that the < code > /items/find-by-metadata-value< / code > endpoint is simply designed flawed and shouldn’ t be fatally erroring when the search returns items the user doesn’ t have permission to access< / li >
< li > It would take way too much time to try to fix the fucked up items that are in limbo by deleting them in SQL, but also, it doesn’ t actually fix the problem because some items are < em > submitted< / em > but < em > withdrawn< / em > , so they actually have handles and everything< / li >
< li > I think the solution is to recommend people don’ t use the < code > /items/find-by-metadata-value< / code > endpoint< / li >
< / ul > < / li >
2019-05-05 15:45:12 +02:00
< li > < p > CIP is asking about embedding PDF thumbnail images in their RSS feeds again< / p >
2019-05-01 10:53:26 +02:00
< ul >
< li > They asked in 2018-09 as well and I told them it wasn’ t possible< / li >
< li > To make sure, I looked at < a href = "https://wiki.duraspace.org/display/DSPACE/Enable+Media+RSS+Feeds" > the documentation for RSS media feeds< / a > and tried it, but couldn’ t get it to work< / li >
< li > It seems to be geared towards iTunes and Podcasts… I dunno< / li >
< / ul > < / li >
2019-05-05 15:45:12 +02:00
< li > < p > CIP also asked for a way to get an XML file of all their RTB journal articles on CGSpace< / p >
2019-05-01 11:24:01 +02:00
< ul >
2019-05-05 15:45:12 +02:00
< li > < p > I told them to use the REST API like (where < code > 1179< / code > is the id of the RTB journal articles collection):< / p >
2019-05-01 10:53:26 +02:00
2019-05-01 11:24:01 +02:00
< pre > < code > https://cgspace.cgiar.org/rest/collections/1179/items?limit=812& expand=metadata
2019-05-05 15:45:12 +02:00
< / code > < / pre > < / li >
< / ul > < / li >
< / ul >
2019-05-01 11:24:01 +02:00
2019-05-03 09:29:01 +02:00
< h2 id = "2019-05-03" > 2019-05-03< / h2 >
< ul >
2019-05-05 15:45:12 +02:00
< li > < p > A user from CIAT emailed to say that CGSpace submission emails have not been working the last few weeks< / p >
2019-05-03 09:29:01 +02:00
< ul >
2019-05-05 15:45:12 +02:00
< li > < p > I checked the < code > dspace test-email< / code > script on CGSpace and they are indeed failing:< / p >
2019-05-03 09:29:01 +02:00
< pre > < code > $ dspace test-email
About to send test email:
2019-05-05 15:45:12 +02:00
- To: woohoo@cgiar.org
- Subject: DSpace test email
- Server: smtp.office365.com
2019-05-03 09:29:01 +02:00
Error sending email:
2019-05-05 15:45:12 +02:00
- Error: javax.mail.AuthenticationFailedException
2019-05-03 09:29:01 +02:00
Please see the DSpace documentation for assistance.
2019-05-05 15:45:12 +02:00
< / code > < / pre > < / li >
< / ul > < / li >
2019-05-03 09:29:01 +02:00
2019-05-05 15:45:12 +02:00
< li > < p > I will ask ILRI ICT to reset the password< / p >
2019-05-03 15:33:34 +02:00
< ul >
< li > They reset the password and I tested it on CGSpace< / li >
< / ul > < / li >
2019-05-03 09:29:01 +02:00
< / ul >
2019-05-05 15:45:12 +02:00
< h2 id = "2019-05-05" > 2019-05-05< / h2 >
< ul >
< li > Run all system updates on DSpace Test (linode19) and reboot it< / li >
< li > Merge changes into the < code > 5_x-prod< / code > branch of CGSpace:
< ul >
< li > Updates to remove deprecated social media websites (Google+ and Delicious), update Twitter share intent, and add item title to Twitter and email links (< a href = "https://github.com/ilri/DSpace/pull/421" > #421< / a > )< / li >
< li > Add new CCAFS Phase II project tags (< a href = "https://github.com/ilri/DSpace/pull/420" > #420< / a > )< / li >
< li > Add item ID to REST API error logging (< a href = "https://github.com/ilri/DSpace/pull/422" > #422< / a > )< / li >
< / ul > < / li >
< li > Re-deploy CGSpace from < code > 5_x-prod< / code > branch< / li >
< li > Run all system updates on CGSpace (linode18) and reboot it< / li >
2019-05-05 22:53:42 +02:00
< li > Tag version 1.1.0 of the < a href = "https://github.com/ilri/dspace-statistics-api" > dspace-statistics-api< / a > (with Falcon 2.0.0)
< ul >
< li > Deploy on DSpace Test< / li >
< / ul > < / li >
2019-05-05 15:45:12 +02:00
< / ul >
2019-05-06 10:50:57 +02:00
< h2 id = "2019-05-06" > 2019-05-06< / h2 >
< ul >
< li > < p > Peter pointed out that Solr stats are only showing 2019 stats< / p >
< ul >
< li > < p > I looked at the Solr Admin UI and I see:< / p >
< pre > < code > statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
< / code > < / pre > < / li >
< / ul > < / li >
< li > < p > As well as this error in the logs:< / p >
< pre > < code > Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
< / code > < / pre > < / li >
< li > < p > Strangely enough, I < em > do< / em > see the statistics-2018, statistics-2017, etc cores in the Admin UI… < / p > < / li >
< li > < p > I restarted Tomcat a few times (and even deleted all the Solr write locks) and at least five times there were issues loading one statistics core, causing the Atmire stats to be incomplete< / p >
< ul >
< li > Also, I tried to increase the < code > writeLockTimeout< / code > in < code > solrconfig.xml< / code > from the default of 1000ms to 10000ms< / li >
< li > Eventually the Atmire stats started working, despite errors about “ Error opening new searcher” in the Solr Admin UI< / li >
< li > I wrote to the dspace-tech mailing list again on the thread from March, 2019< / li >
< / ul > < / li >
2019-05-06 14:41:40 +02:00
< li > < p > There were a few alerts from UptimeRobot about CGSpace going up and down this morning, along with an alert from Linode about 596% load< / p >
< ul >
< li > Looking at the Munin stats I see an exponential rise in DSpace XMLUI sessions, firewall activity, and PostgreSQL connections this morning:< / li >
< / ul > < / li >
< / ul >
< p > < img src = "/cgspace-notes/2019/05/2019-05-06-jmx_dspace_sessions-day.png" alt = "CGSpace XMLUI sessions day" / > < / p >
< p > < img src = "/cgspace-notes/2019/05/2019-05-06-fw_conntrack-day.png" alt = "linode18 firewall connections day" / > < / p >
< p > < img src = "/cgspace-notes/2019/05/2019-05-06-postgres_connections_db-day.png" alt = "linode18 postgres connections day" / > < / p >
< p > < img src = "/cgspace-notes/2019/05/2019-05-06-cpu-day.png" alt = "linode18 CPU day" / > < / p >
< ul >
< li > < p > The number of unique sessions today is < em > ridiculously< / em > high compared to the last few days considering it’ s only 12:30PM right now:< / p >
< pre > < code > $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-06 | sort | uniq | wc -l
101108
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-05 | sort | uniq | wc -l
14618
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-04 | sort | uniq | wc -l
14946
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-03 | sort | uniq | wc -l
6410
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-02 | sort | uniq | wc -l
7758
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-01 | sort | uniq | wc -l
20528
< / code > < / pre > < / li >
< li > < p > The number of unique IP addresses from 2 to 6 AM this morning is already several times higher than the average for that time of the morning this past week:< / p >
< pre > < code > # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
7127
# zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep -E '05/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
1231
# zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '04/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
1255
# zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz | grep -E '03/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
1736
# zcat --force /var/log/nginx/access.log.4.gz /var/log/nginx/access.log.5.gz | grep -E '02/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
1573
# zcat --force /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.6.gz | grep -E '01/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
1410
< / code > < / pre > < / li >
< li > < p > Just this morning between the hours of 2 and 6 the number of unique sessions was < em > very< / em > high compared to previous mornings:< / p >
< pre > < code > $ cat dspace.log.2019-05-06 | grep -E '2019-05-06 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
83650
$ cat dspace.log.2019-05-05 | grep -E '2019-05-05 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
2547
$ cat dspace.log.2019-05-04 | grep -E '2019-05-04 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
2574
$ cat dspace.log.2019-05-03 | grep -E '2019-05-03 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
2911
$ cat dspace.log.2019-05-02 | grep -E '2019-05-02 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
2704
$ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
3699
< / code > < / pre > < / li >
< li > < p > Most of the requests were GETs:< / p >
< pre > < code > # cat /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E " (GET|HEAD|POST|PUT)" | sort | uniq -c | sort -n
1 PUT
98 POST
2845 HEAD
98121 GET
< / code > < / pre > < / li >
< li > < p > I’ m not exactly sure what happened this morning, but it looks like some legitimate user traffic—perhaps someone launched a new publication and it got a bunch of hits?< / p > < / li >
< li > < p > Looking again, I see 84,000 requests to < code > /handle< / code > this morning (not including logs for library.cgiar.org because those get HTTP 301 redirect to CGSpace and appear here in < code > access.log< / code > ):< / p >
< pre > < code > # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -c -o -E " /handle/[0-9]+/[0-9]+"
84350
< / code > < / pre > < / li >
< li > < p > But it would be difficult to find a pattern for those requests because they cover 78,000 < em > unique< / em > Handles (ie direct browsing of items, collections, or communities) and only 2,492 discover/browse (total, not unique):< / p >
< pre > < code > # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E " /handle/[0-9]+/[0-9]+ HTTP" | sort | uniq | wc -l
78104
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E " /handle/[0-9]+/[0-9]+/(discover|browse)" | wc -l
2492
< / code > < / pre > < / li >
< li > < p > In other news, I see some IP is making several requests per second to the exact same REST API endpoints, for example:< / p >
< pre > < code > # grep /rest/handle/10568/3703?expand=all rest.log | awk '{print $1}' | sort | uniq -c
3 2a01:7e00::f03c:91ff:fe0a:d645
113 63.32.242.35
< / code > < / pre > < / li >
< li > < p > According to < a href = "https://viewdns.info/reverseip/?host=63.32.242.35&t=1" > viewdns.info< / a > that server belongs to Macaroni Brothers’ < / p >
< ul >
< li > The user agent of their non-REST API requests from the same IP is Drupal< / li >
< li > This is one very good reason to limit REST API requests, and perhaps to enable caching via nginx< / li >
< / ul > < / li >
2019-05-06 10:50:57 +02:00
< / ul >
2019-05-01 10:53:26 +02:00
<!-- vim: set sw=2 ts=2: -->
< / article >
< / div > <!-- /.blog - main -->
< aside class = "col-sm-3 ml-auto blog-sidebar" >
< section class = "sidebar-module" >
< h4 > Recent Posts< / h4 >
< ol class = "list-unstyled" >
< li > < a href = "/cgspace-notes/2019-05/" > May, 2019< / a > < / li >
< li > < a href = "/cgspace-notes/posts/" > Posts< / a > < / li >
< li > < a href = "/cgspace-notes/2019-04/" > April, 2019< / a > < / li >
< li > < a href = "/cgspace-notes/2019-03/" > March, 2019< / a > < / li >
< li > < a href = "/cgspace-notes/2019-02/" > February, 2019< / a > < / li >
< / ol >
< / section >
< section class = "sidebar-module" >
< h4 > Links< / h4 >
< ol class = "list-unstyled" >
< li > < a href = "https://cgspace.cgiar.org" > CGSpace< / a > < / li >
< li > < a href = "https://dspacetest.cgiar.org" > DSpace Test< / a > < / li >
< li > < a href = "https://github.com/ilri/DSpace" > CGSpace @ GitHub< / a > < / li >
< / ol >
< / section >
< / aside >
< / div > <!-- /.row -->
< / div > <!-- /.container -->
< footer class = "blog-footer" >
< p >
Blog template created by < a href = "https://twitter.com/mdo" > @mdo< / a > , ported to Hugo by < a href = 'https://twitter.com/mralanorth' > @mralanorth< / a > .
< / p >
< p >
< a href = "#" > Back to top< / a >
< / p >
< / footer >
< / body >
< / html >