25 Jun 2019 • 2 min read • in Meta

Semantic web and rant

Written by Lionir

So, the semantic web, I came to it from being simply interesting in it because well, I thought it was good to make the computer understand my content and it also made HTML prettier in my eyes.. especially HTML5 semantic elements.

That's cool stuff and I learn along the way although, deeper semantics have left me deeply confused since their documentation is quite hard and there's multiple implementations so for now, I decided to leave it at that and use simple semantics for the most part.

Things like, the details element below, the figure element to put captions and picture elements together, the time element to make the time machine-readable making that data potentially useful for research, the nav element for navigation elements, all these kind of things which make parsing much easier than a bazillion div elements.

Well, that brought me to well, the most common use for machine-readable semantics – Search. By far, search is what needs the most semantics to make the results appropriate. That lead me to changing my robots.txt to allow crawling of certain parts which was fairly straight forward.

Then I thought sitemap could be nice to allow it more easily and then since I had no feedback on those changes, I took my helmet and braced Google's search console which is.. painful, there's delays everywhere, you need to verify you own the website, you need to make sure your results look fine which will require even more waiting... then yesterday, I realised something terrible.

Certain pages which I didn't allow to be crawled were indexed, there was no way to remove it from any interface nor a hint to tell me so then I talked to a friend about it and he linked me something that made me very mad.

Keep in mind that these settings can be read and followed only if crawlers are allowed to access the pages that include these settings. The <meta name="robots" content="noindex" /> tag or directive applies to search engine crawlers. To block non-search crawlers, such as AdsBot-Google, you might need to add directives targeted to the specific crawler (for example, <meta name="AdsBot-Google" content="noindex" />). -Google

And this non-standard is really annoying. Not only does Google not see that not Crawling should equate in non-indexing but their solution to make the difference is a standard that they don't even follow all the time, they say it only applies to search-engine bots.

Why is that a problem? Indexing and crawling is one of the most basic flows of requests so when you can't have a simple opt-out from all of them, you'll end up with unwanted traffic and/or a million meta tags to control it and that just makes me terribly mad.

If anyone at W3C or at Google reads this, please standardize a sane system for indexing and crawling. Robots.txt is not standardized so that can be painful to deal with and the meta robots tag doesn't have a simple catch-all nor is it standardized either.