Posted on: 06/12/21 02:12PM
Mderms said:
Yes, but while we're on the subject and there are a few threads active about this already, can someone explain to me like I'm 5 why this limitation is even a thing? I get that its a hard coded limit to stop the servers exploding, but why would the servers explode in the first place? Is it a hardware/software limitation? Or is it a spaghetti code problem. Why is it that similar websites to this don't have this same issue? Is it something to do with how you guys started developing the site versus how someone else started developing theirs and at this point you are too far along to easily change the foundation of how the site works and brings up information?
Most websites directly query their SQL database for searching. This is essentially infinitely scalable, but can be slow since it isn't optimized for any particular task.
To dramatically increase the speed of searches, we instead use Apache Solr, which is optimized specifically for searching text within documents. This is intended more for full-text searches on entire pages of text (imagine, for example, searching within an entire database of academic papers for a given word or line of text), but given that all of a post's tags are stored in a single text field in tha database, it also works quite well for our purposes. It does this very quickly, but at the expense of additional server load. This quickly scales up with the depth within the
results, just due to the way the algorithm is designed. Directly searching the database, on the other hand, scales up with the size of the entire table. It scales up slower than Solr's scaling, but the table growing larger impacts
all searches, regardless of depth. Solr, on the other hand, isn't slowed down nearly as much by large tables - it only use lots of resources for deep searches. Given that the vast majority of searches are only a few pages deep at most, Solr is drastically faster on a table of 6 million posts than a direct database query - unless someone tries to go particularly deep into a search, which we've prevented.
The "id:<" trick works because it cuts out all of the results from before that ID. Since Solr's resource usage increases with the depth into the
reasults, not the deptb into the database, this essentially cuts our all of the resource usage from before that point.
This could theorerically be implemented automaticaly in our backend, but it isn't as simple as just grabbing the post ID at a certain depth - the search doesn't
know that until it does the search. We could change our pagination to work based on post ID's instead of search depth, but then you'd only be able to move back or forward one page at a time because it won't know the first and last ID's on other pages without searching at that depth first. Also, this would be a major change to both the backend and frontend that would essentially require the whole search system to be rebuilt from the ground up. Searches with page numbers higher than the limit could automatically be redirected to the "id:" search, but it would need to complete the search right up to the limit to get that ID and then complete the new search with the ID metatag - two searches for one page load. It could fall back to a direct database search over a certain depth, but that has its own problems, and searches past that depth would suddenly become
slow, even if they impact other users' experience less.
We could put instructions on getting around the limit on the error page so people don't have to find the answer on the forums, similar to the fringe filter's instructions, but they'd still need to use the ID metatag trick.