Post by Scott DorseyPost by RichmondWhy can't web scrapers just pretend to be Lynx browsers?
Some do. That's why so many web servers refuse connections from Lynx.
IME it's more common for HTTP servers to react to "libwww" in
Lynx' User-Agent: rather than "Lynx": removing the former (while
keeping "Lynx") have often enough resolved the issue for me.
(These days, I mostly just switch to reading the site via
http://web.archive.org/ right away, though.)
Might be because "libwww" is both the name of the library Lynx
is based on, /and/ the name of an unrelated (AIUI) Perl library
that, I gather, used to be popular among web robot writers.
(See, e. g., http://packages.debian.org/sid/libwww-perl .)
A cursory look over my access.log files seems to hint that Go
is way more popular a choice for the task these days, though
my overall impression is that robot authors just use any of
the popular user agent strings for their software instead of
anything that might identify their actual codebase.
Which means that making *any* big decisions based on User-Agent:
statistics (like, "Look, we're getting lots of hits from
Arachne users recently; let's optimize our site for their
best experience at once!") is ill-advised at best: you might
end up being trolled by a particularly creative botnet operator.
Personally, as a web author, I try to a. stick to the standards;
b. have an actual reason for using one feature or another
(rather than going for "for consistency" or "just because" or
"this new shiny framework needs it") [*]; and c. mind my audience.
Sure, I use Lynx a lot for testing, so the webpages I author
tend to end up being compatible with Lynx, and might be less
compatible with other UAs. However, the idea that I should
adapt my practices to the idiosynchrasies of any particular
UA, regardless of its market share, rubs me the wrong way.
The "making sure the site works with IE" sort of wrong.
Conversely, as a reader of that same web, I expect to get a
standards-compliant document from the site. I deem it my own
responsibility to make use of it. For instance, I certainly
won't hold it against the site operator if /my/ software chokes
on something that /is/ standard.
What really irks me, though, is when in place of a document,
I get an application. (Doesn't even matter if it's .js, .exe,
or .tex.)
Not that I don't get disappointed on occasion when a website
"improves" its typography, or switches to a more "mobile-friendly"
look and feel. But that's one of the major reasons for me to
stick with Lynx in the first place: go and try to tweak the CSS
to make your website look more "modern" when viewed with Lynx!
[*] As a rule, my HTML is expected to comply with the requirements
of the Live Standard, for both text/html and application/xml+xhtml
Content-Type:s at the same time (the idea is that if .xhtml does
not work for someone, the file can be downloaded, renamed to
.html, and viewed that way.) My CSS should be /mostly/ 2.1
with some CSS3 Selectors (though I haven't quite checked it.)
When JavaScript is used (i. e., when I publish an application,
not just a document), it's ought to conform to ECMA-262 6 (2015),
though the set of browser APIs used might vary depending on what
the application aims to do.