- Peter Ballantyne said he was having problems logging into CGSpace with "both" of his accounts (CGIAR LDAP and personal, apparently)
- I looked in the logs and saw some LDAP lookup failures due to timeout but also strangely a "no DN found" error:
```
2017-10-01 20:24:57,928 WARN org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:ldap_attribute_lookup:type=failed_search javax.naming.CommunicationException\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is java.net.ConnectException\colon; Connection timed out (Connection timed out)]
2017-10-01 20:22:37,982 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:failed_login:no DN found for user pballantyne
- Linode emailed to say that linode578611 (DSpace Test) needs to migrate to a new host for a security update so I initiated the migration immediately rather than waiting for the scheduled time in two weeks
- We'll need to check for browse links and handle them properly, including swapping the `subject` parameter for `systemsubject` (which doesn't exist in Discovery yet, but we'll need to add it) as we have moved their poorly curated subjects from `dc.subject` to `cg.subject.system`
- The second link was a direct link to a bitstream which has broken due to the sequence being updated, so I told him he should link to the handle of the item instead
- Working on the nginx redirects for CGIAR Library
- We should start using 301 redirects and also allow for `/sitemap` to work on the library.cgiar.org domain so the CGIAR System Organization people can update their Google Search Console and allow Google to find their content in a structured way
- Remove eleven occurrences of `ACP` in IITA's `cg.coverage.region` using the Atmire batch edit module from Discovery
- Need to investigate how we can verify the library.cgiar.org using the HTML or DNS methods
- Deploy logic to allow verification of the library.cgiar.org domain in the Google Search Console ([#343](https://github.com/ilri/DSpace/pull/343))
- After verifying both the HTTP and HTTPS domains and submitting a sitemap it will be interesting to see how the stats in the console as well as the search results change (currently 28,500 results):
- I tried to submit a "Change of Address" request in the Google Search Console but I need to be an owner on CGSpace's console (currently I'm just a user) in order to do that
- Peter added me as an owner on the CGSpace property on Google Search Console and I tried to submit a "Change of Address" request for the CGIAR Library but got an error:
![Change of Address error](/cgspace-notes/2017/10/search-console-change-address-error.png)
- We are sending top-level CGIAR Library traffic to their specific community hierarchy in CGSpace so this type of change of address won't work—we'll just need to wait for Google to slowly index everything and take note of the HTTP 301 redirects
- Also the Google Search Console doesn't work very well with Google Analytics being blocked, so I had to turn off my ad blocker to get the "Change of Address" tool to work!
- Finally finish (I think) working on the myriad nginx redirects for all the CGIAR Library browse stuff—it ended up getting pretty complicated!
- I still need to commit the DSpace changes (add browse index, XMLUI strings, Discovery index, etc), but I should be able to deploy that on CGSpace soon
- Run system updates on DSpace Test and reboot server
- Merge changes adding a search/browse index for CGIAR System subject to `5_x-prod` ([#344](https://github.com/ilri/DSpace/pull/344))
- I checked the top browse links in Google's search results for `site:library.cgiar.org inurl:browse` and they are all redirected appropriately by the nginx rewrites I worked on last week
- I still have no idea what was causing the load to go up today
- I finally investigated Magdalena's issue with the item download stats and now I can't reproduce it: I get the same number of downloads reported in the stats widget on the item page, the "Most Popular Items" page, and in Usage Stats
- I think it might have been an issue with the statistics not being fresh
- CORE seems to be some bot that is "Aggregating the world’s open access research papers"
- The contact address listed in their bot's user agent is incorrect, correct page is simply: https://core.ac.uk/contact
- I will check the logs in a few days to see if they are harvesting us regularly, then add their bot's user agent to the Tomcat Crawler Session Valve
- After browsing the CORE site it seems that the CGIAR Library is somehow a member of CORE, so they have probably only been harvesting CGSpace since we did the migration, as library.cgiar.org directs to us now
- For now I will just contact them to have them update their contact info in the bot's user agent, but eventually I think I'll tell them to swap out the CGIAR Library entry for CGSpace
- At least we know the top two are CORE, but who are the others?
- 190.19.92.5 is apparently in Argentina, and 104.196.152.243 is from Google Cloud Engine
- Actually, these two scrapers might be more responsible for the heavy load than the CORE bot, because they don't reuse their session variable, creating thousands of new sessions!
- From looking at the requests, it appears these are from CIAT and CCAFS
- I wonder if I could somehow instruct them to use a user agent so that we could apply a crawler session manager valve to them
- Actually, according to the Tomcat docs, we could use an IP with `crawlerIps`: https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve
- Ah, wait, it looks like `crawlerIps` only came in 2017-06, so probably isn't in Ubuntu 16.04's 7.0.68 build!
- That would explain the errors I was getting when trying to set it:
```
WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property 'crawlerIps' to '190\.19\.92\.5|104\.196\.152\.243' did not find a matching property.
- I've emailed the CORE people to ask if they can update the repository information from CGIAR Library to CGSpace
- Also, I asked if they could perhaps use the `sitemap.xml`, OAI-PMH, or REST APIs to index us more efficiently, because they mostly seem to be crawling the nearly endless Discovery facets
- I added [GoAccess](https://goaccess.io/) to the list of package to install in the DSpace role of the [Ansible infrastructure scripts](https://github.com/ilri/rmg-ansible-public)
- It makes it very easy to analyze nginx logs from the command line, to see where traffic is coming from:
- According to Uptime Robot CGSpace went down and up a few times
- I had a look at goaccess and I saw that CORE was actively indexing
- Also, PostgreSQL connections were at 91 (with the max being 60 per web app, hmmm)
- I'm really starting to get annoyed with these guys, and thinking about blocking their IP address for a few days to see if CGSpace becomes more stable
- Actually, come to think of it, they aren't even obeying `robots.txt`, because we actually disallow `/discover` and `/search-filter` URLs but they are hitting those massively: