Google has clarified how its Googlebot ranks pages, saying it will automatically index the first 15MB of a page.
As the dominant search engine by a wide margin, many websites live or die by their Google rankings. As a result, understanding exactly how it works is the goal of many a webmaster. Google has shed a bit more light on the topic, outlining how its Googlebot crawls and indexes pages:
Googlebot can crawl the first 15MB of content in an HTML file or supported text-based file. After the first 15MB of the file, Googlebot stops crawling and only considers the first 15MB of content for indexing.
The tidbit is important, giving webmasters a target to shoot for. The 15MB limit should also encourage the design of lean, performant websites, ones that are equally suitable on a smartphone as a desktop with a high-speed connection.
Google announced that in April, it will update the smartphone user-agent of Googlebot. The company says the update is so that its renderer can better understand pages that use newer web technologies.
Before
After
“Our renderer evolves over time and the user-agent string indicates that that it is becoming more similar to Chrome than Safari,” says Google software engineer Katsuaki Ikegami. “To make sure your site can be viewed properly by a wide range of users and browsers, we recommend using feature detection and progressive enhancement.”
“Our evaluation suggests that this user-agent change should have no effect on 99% of sites,” Ikegami adds. “The most common reason a site might be affected is if it specifically looks for a particular Googlebot user-agent string. User-agent sniffing for Googlebot is not recommended and is considered to be a form of cloaking. Googlebot should be treated like any other browser.”
Google suggests checking your site with is Fetch and Render Tool (in Search Console) if you think your site might be affected.
Google just introduced a new Webmaster Tools feature called the Blocked Resource Report, aimed at helping webmasters find and resolve issues where Google can’t use images, CSS, or JavaScript that has been blocked. Blocked resource prevent pages from rendering properly, and Google wants to make sure you’re only blocking what you really want/need to be.
The report provides the names of the hosts from which your site is using blocked resources. If you click on a row, it gives you the list of blocked resources and the pages that embed them. This should help you figure out the issues and take care of them so Google can better crawl and index your content.
Some resources will be hosted on your site, while others will be hosted on others. Clicking on a host will also give you a count of pages on your site affected by each blocked resource. Clicking on any blocked resource will give you a list of pages that load that resource. If you click on any page in the table hosting a blocked resource, you’ll get instructions for unblocking that particular resource.
In a help center article, Google runs down five steps for evaluating and redcuing your list of blocked resources:
1. Open the Blocked Resources Report to find a list of hosts of blocked resources on your site. Start with the hosts that you own, since you can directly update the robots.txt files, if needed.
2. Click a host on the report to see a list of blocked resources from that host. Go through the list and start with those that might affect the layout in a meaningful way. Less important resources, such as tracking pixels or counters, aren’t worth bothering with.
3. For each resource that affects layout, click to see a list of your pages that uses it. Click on any page in the list and follow the pop-up instructions for viewing the difference and updating the blocking robots.txt file. Fetch and render after each change to verify that the resource is now appearing.
4. Continue updating resources for a host until you’ve enabled Googlebot access to all the important blocked resources.
5. Move on to hosts that you don’t own, and if the resources have a strong visual impact, either contact the webmaster of those sites to ask them to consider unblocking the resource to Googlebot, or consider removing your page’s dependency on that resource.
There’s also an update to Fetch and Render, which shows how the blocked resources matter. When you request a URL to be fetched and rendered, it shows screenshots rendered both as Googlebot and as a typical user, so you get a better grasp on the problems.
“Webmaster Tools attempts to show you only the hosts that you might have influence over, so at the moment, we won’t show hosts that are used by many different sites (such as popular analytics services),” says Google webmaster trends analyst John Mueller. “Because it can be time-consuming (usually not for technical reasons!) to update all robots.txt files, we recommend starting with the resources that make the most important visual difference when blocked.”
In January, Google called on webmasters to offer suggestions for new features for Webmaster Tools. It set up a Google Moderator page where people could leave and vote on suggestions. Among the most popular suggestions were:
“I would like to see in WMT data from 12 months, not 3 as it is now :)”
“An automated action viewer, so webmasters can see if they were impacted by an algorithm such as Panda or Penguin>”
“Bounce back measuring tool. Did the user go back to Google for a similar search or did they find what they needed?”
Google has since given webmasters a new structured data tool.
Google announced that it’s introducing new locale-aware crawl configurations for Googlebot for pages it detects may adapt their content based on the request’s language and perceived location.
“Locale-adaptive pages change their content to reflect the user’s language or perceived geographic location,” Google says in a blog post. Since, by default, Googlebot requests pages without setting an Accept-Language HTTP request header and uses IP addresses that appear to be located in the USA, not all content variants of locale-adaptive pages may be indexed completely.”
The new configurations are geo-distributed crawling and language-dependent crawling. The former sees Googlebot starting to use IP addresses that appear to come from outside of the U.S. as well as the current IP addresses that appear to be from the U.S. that Googlebot already uses. The latter is where Googlebot crawls with an Accept-Language HTTP header in the request.
The new configurations are enabled automatically for pages Google detects to be locale-adaptive, and the company warns you may notice changes in how it crawls and shows your site in search results even if you haven’t changed your CMS or server settings.
Google supports and recommends using separate locale URLconfigurations and annotating them with rel=”alternate hreflang annotations. It considers using separate URLs the best way for users to interact and share your content, while maximizing indexing and better ranking your content.
Google has a new “Webmaster Help” video out about e-commerce pages with multiple breadcrumb trails. This is the second video in a row to deal specifically with e-commerce sites. Last time, Matt Cutts discussed product pages for products that are no longer available.
This time, he takes on the following question:
Many of my items belong to multiple categories on my eCommerce site. Can I place multiple breadcrumbs on a page? Do they confuse Googlebot? Do you properly understand the logical structure of my site?
“It turns out, if you do breadcrumbs, we will currently pick the first one,” he says. “I would try to get things in the right category or hierarchy as much as you can, but that said, if an item does belong to multiple areas within your hierarchy it is possible to go ahead and have multiple breadcrumbs on a page, and in fact that can, in some circumstances, actually help Googlebot understand a little bit more about the site.”
“But don’t worry about it if it only fits in one, or if you’ve only got breadcrumbs for one,” Cutts continues. “That’s the way that most people do it. That’s the normal way to do it. We encourage that, but if you do have the taxonomy (the category, the hierarchy), you know, and it’s already there, and it’s not like twenty different spots within your categories…if it’s in a few spots, you know, two or three or four…something like that, it doesn’t hurt to have those other breadcrumbs on the page. And we’ll take the first one. That’s our current behavior, and then we might be able to do a little bit of deeper understanding over time about the overall structure of your site.”
For more about how Google treats breadcrumbs, you might want to take a look at this page in Google’s webmaster help center. In fact, it even gives an example of a page having more than one breadcrumb trail (Books>Authors>Stephen King and Books>Fiction>Horror)
About a year ago, Google put out a Webmaster Help video discussing PageRank as it relates to 301 redirects. Specifically, someone asked, “Roughly what percentage of PageRank is lost through a 301 redirect?”
Google’s Matt Cutts responded, noting that it can change over time, but that it had been “roughly the same” for quite a while.
“The amount of PageRank that dissipates through a 301 is currently identical to the amount of PageRank that dissipates through a link,” he explained. “So they are utterly the same in terms of the amount of PageRank that dissipates going through a 301 versus through a link. So that doesn’t mean use a 301. It doesn’t mean use a link. It means use whatever is best for your purposes because you don’t get to hoard or conserve any more PageRank if you use a 301, and likewise it doesn’t hurt you if you use a 301.”
In a new Webmaster Central office hours video (via Search Engine Roundtable), Google’s John Mueller dropped another helpful tidbit related to redirects in that GoogleBot will follow up to five at the same time.
“We generally prefer to have fewer redirects in a chain if possible. I think GoogleBot follows up to five redirects at the same time when it’s trying to crawl a page, so up to give would do within the same cycle. If you have more than five in a chain, then we would have to kind of think about that the next time we crawled that page, and follow the rest of the redirects…We generally recommend trying to reduce it to one redirect wherever possible. Sometimes there are technical reasons why that’s not possible, so something with two redirects is fine.”
As Barry Schwartz at SER notes, this may be the first time Google has given a specific number. In the comments of his post, Michael Martinez says it used to be 2.
Google has released a new Webmaster Help video in response to a question from a user who has been having trouble getting Google to fetch their robots.txt file. Here’s what the user said:
“I’m getting errors from Google Webmaster Tools about the Googlebot crawler being unable to fetch my robots.txt 50% of the time (but I can fetch it with 100% success rate from various other hosts). (On a plain old nginx server and an mit.edu host.)”
Google’s Matt Cutts begins by indicating that he’s not saying this is the case here, but…
“Some people try to cloak, and they end up making a mistake, and they end up reverse-cloaking. So when a regular browser visits, they server the content, and when Google comes and visits, they will serve empty or completely zero length content. So every so often, we see that – where in trying to cloak, people actually make a mistake and shoot themselves in the foot, and don’t show any content at all to Google.”
“But, one thing that you might not know, and most people don’t know (we just confirmed it ourselves), is you can use the free fetch as googlebot feature in Google Webmaster Tools on robots.txt,” he adds. “So, if you’re having failures 50% of the time, then give that a try, and see whether you can fetch it. Maybe you’re load balancing between two servers, and one server has some strange configuration, for example.”
Something to think about if this is happening to you (and hopefully you’re not really trying to cloak). More on Fetch as Google here.
Google has put out a new webmaster help video with Matt Cutts. The basis of the user-submitted question isn’t even accurate, and Cutts still took the time to make the video and answer it. This goes to show that there is a solid chance that Google will answer your questions when you send them.
The question was:
The Webmaster Tools “Fetch as Googlebot” feature does not allow one to fetch an https page, making it not very useful for secure sites – any plans to change that?
“So, we just tried it here, and it works for us,” said Cutts. “You have to register, and prove that you own the https site, just like you do with an http site. Once you’ve proven that you control or verify that you are able to control that https page, you absolutely can fetch. You need to include the protocol when you’re doing the fetch, but it should work just fine. If it doesn’t work, maybe show up in the webmaster forum, and give us some feedback, but we just tried it on our side, and it looks like it’s working for us.”
How’s that for customer support?
They must be getting close to the end of this batch of videos.
Google may be getting better at crawling javascript and Ajax.
In a Tumblr post, developer Alex Pankratov wrote this week about spotting an “ajax request issued from document.ready() callback of one website’s pages.”
“This means that the bot now executes the Javascript on the pages it crawls,” Pankratov wrote. “The IP of 66.249.67.106 is crawl-66-249-67-106.googlebot.com and the A record is a match, so this is in fact a Google Bot.”
He then shows a line, which he says “is fetched via Ajax by a Javascript function in response to the menu item click,” and adds, “Also, note the x argument – it is dynamically added and only by that specific function. This means that the bot now emulates a user clicking around the site and then seeing which actionable items lead to which additional pages.”
Sean Gallagher at Ars Technica equates this to Googlebot learning to read interactive pages more like humans. “It appears Google’s bots have been trained to act more like humans to mine interactive site content, running the JavaScript on pages they crawl to see what gets coughed up,” he writes.
Google has indicated that it is getting better at handling javascript and AJAX. Here’s a video Google’s Matt Cutts put out about how Google handles AJAX a while back:
Cutts was asked, “How effective is Google now at handling content supplied via Ajax, is this likely to improve in the future?”
He responded, “Well, let me take Ajax, which is Asynchronous Javascript, and make it just Javascript for the time being. Google is getting more effective over time, so we actually have the ability not just to scan in strings of Javascript to look for URLs, but to actually process some of the Javascript. And so that can help us improve our crawl coverage quite a bit, especially if people use Javascript to help with navigation or drop-downs or those kinds of things. So Asynchronous Javascript is a little bit more complicated, and that’s maybe further down the road, but the common case is Javascript.”
“And we’re getting better, and we’re continuing to improve how well we’re able to process Javascript,” he continues. “In fact, let me just take a little bit of time and mention, if you block Javascript or CSS in your robots.txt, where Googlebot can’t crawl it, I would change that. I would recommend making it so that Googlebot can crawl the Javascript and can crawl the CSS, because that makes it a lot easier for us to figure out what’s going on if we’re processing the Javascript or if we’re seeing and able to process and get a better idea of what the page is like.”
Update: Barry Schwartz at Search Engine Roundtable says, “Google has been doing this for a while. Back in 2009 GoogleBot was executing JavaScript and in November 2011 Google began doing so with AJAX.”
Google uploaded a new Webmaster Help video from Matt Cutts, which addresses a question about the hardware/server-side software that powers a typical Googlebot server.
“So one of the secrets of Google is that rather than employing these mainframe machines, this heavy iron, big iron kind of stuff, if you were to go into a Google data center and look at an example rack, it would look a lot like a PC,” says Cutts. “So there’s commodity PC parts. It’s the sort of thing where you’d recognize a lot of the stuff from having opened up your own computer,and what’s interesting is rather than have like special Googlebot web crawling servers, we tend to say, OK, build a whole bunch of different servers that can be used interchangeably for things like Googlebot, or web serving, or indexing. And then we have this fleet, this armada of machines, and you can deploy it on different types of tasks and different types of processing.”
“So hardware wise, they’re not exactly the same, but they look a lot like regular commodity PCs,” he adds. “And there’s no difference between Googlebot servers versus regular servers at Google. You might have differences in RAM or hard disk, but in general, it’s the same sorts of stuff.”
On the software side, Google of course builds everything itself, as to not have to rely on third-parties. Cutts says there’s a running joke at Google along the lines of “we don’t just build the cars oursevles, and we don’t just build the tires ourselves. We actually vulcanize the rubber on the tires ourselves.”
“We tend to look at everything all the way down to the metal,” Cutts explains. “I mean, if you think about it, there’s data center efficiency. There’s power efficiency on the motherboards. And so if you can sort of keep an eye on everything all the way down, you can make your stuff a lot more efficient, a lot more powerful. You’re not wasting things because you use some outside vendor and it’s black box.”
“In the same way that you might examine your electricity bill and then tweak the thermostat, we constantly track our energy consumption and use that data to make improvements to our infrastructure. As a result, our data centers use 50 percent less energy than the typical data center,” wrote Joe Kava, Senior Director, data center construction and operations at Google.
Cutts says Google uses a lot of Linux-based machines and Linux-based servers.
“We’ve got a lot of Linux kernel hackers,” he says. “And we tend to have software that we’ve built pretty much from the ground up to do all the different specialized tasks. So even to the point of our web servers. We don’t use Apache. We don’t use IIS. We use something called GWS, which stands for the Google Web Server.”
“So by having our own binaries that we’ve built from our own stuff and building that stack all the way up, it really unlocks a lot of efficiency,” he adds. “It makes sure that there’s nothing that you can’t go in and tweak to get performance gains or to fix if you find bugs.”
If you’re interested in how Google really works, you should watch this video too:
Google announced some improvements for how it indexes smartphone content in mobile search. Googlebot-Mobile will now crawl with a smartphone user-agent in addition to its previous feature phone user-agents.
This, the company says, will allow it to increase its coverage of smartphone content, which means a better search experience for smartphone users.
“The content crawled by smartphone Googlebot-Mobile will be used primarily to improve the user experience on mobile search,” explains software engineer Yoshikiyo Kato. “For example, the new crawler may discover content specifically optimized to be browsed on smartphones as well as smartphone-specific redirects.”
“One new feature we’re also launching that uses these signals is Skip Redirect for Smartphone-Optimized Pages,” adds Kato. “When we discover a URL in our search results that redirects smartphone users to another URL serving smartphone-optimized content, we change the link target shown in the search results to point directly to the final destination URL. This removes the extra latency the redirect introduces leading to a saving of 0.5-1 seconds on average when visiting landing page for such search results.”
Google stresses that webmaster should treat each Googlebot-Mobile request as they would a human user with the same phone user-agent. Google references a blog post from the Webmaster Central blog earlier this year about making sites more mobile friendly.
In that, Webmaster Trends Analyst Pierre Far wrote, “To decide which content to serve, assess which content your website has that best serves the phone(s) in the User-agent string.”
The new smartphone user-agent strings are as follows:
Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)
Is there some new content and/or a new site you want Google to notice sooner rather than later? Well, there’s an official Google utility for that.
With the Fetch as Googlebot URL submission tool, site owners can now directly request that Google send the web-crawling/indexing Googlebot to the URL that was submitted. While Google’s index will usually find these new pages/sites–especially if they are backlinked–this method speeds up the process. In fact, according to the Google Webmaster Central Blog, URL that have been submitted with Fetch as Googlebot are crawled within a day.
There are obvious benefits with this technique, especially if Google’s index has out-of-date pages for your site and you’d like to see the content updated. Clearly, the same is true for new site launches as well. The sooner it’s in the Google index, the better. Quality backlinks and good content are still the key to gaining rank in the index, but not knowing your site will be indexed almost as soon as its launched is a boon for site owners and SEOs, alike.
The blog post details the steps in order to submit a URL to Fetch, and it’s really quite simple. So much so, in fact, it would be foolish not to take advantage of the option.
How to submit a URL
First, use Diagnostics > Fetch As Googlebot to fetch the URL you want to submit to Google. If the URL is successfully fetched you’ll see a new “Submit to index” link appear next to the fetched URL
Once you click “Submit to index” you’ll see a dialog box that allows you to choose whether you want to submit only the one URL, or that URL and all its linked pages.
When submitting individual URLs, we have a maximum limit of 50 submissions per week; when submitting URLs with all linked pages, the limit is 10 submissions per month.
In other words, you can submit 50 pages a month or 10 sites. The post goes on to say that is if your wanting to submit content like images and/or video, use their Sitemap. The Fetch as Googlebot is intended for content that appears in the web search results, also known as text.
Another update allows users to submit unverified URLs to the Googlebot as well. The difference being, with verified submissions, the person submitting most confirm ownership of the site/URL being submitted. Unverified submissions, obviously, do not require the same proof. There’s even a link provided for these kinds of unverified submissions, which takes you to the Crawl URL page.
If you’re a committed site owner and you’re not taking advantage of these capabilities, you are only cheating yourself and your business.
The video that leads this post features Google’s Matt Cutts discussing how long it took Googlebot to recrawl a page, and it was posted on May, 2010 on the Google Webmaster Help YouTube page. While not specific, the answer for sites that frequently update content was “a few days,” but now, with Fetch as Googlebot, if it’s that important that your new content is indexed on an even more rapid basis, well, there’s a utility for that.
Sometimes webmasters set up a spider trap or crawler trap to catch spambots or other crawlers that waste their bandwidth. If some webmasters are right, Googlebot (Google’s crawler) seems to be having some issues here.
In the WebmasterWorld forum, member Starchild started a thread by saying, “I saw today that Googlebot got caught in a spider trap that it shouldn’t have as that dir is blocked via robots.txt. I know of at least one other person recently who this has also happened to. Why is GB ignoring robots?”
Another member suggested that Starchild was mistaken, as such claims have been made in the past, only to find that there were other issues at play.
Starchild responded, however, that it had been in place for “many months” with no changes. “Then I got a notification it was blocked (via the spidertrap notifier). Sure enough, it was. Upon double checking, Google webmaster tools reported a 403 forbidden error. IP was google. I whitelisted it, and Google webmaster tools then gave a success.”
Another ember, nippi, said they also got hit by it 4 months after setting up a spider trap, which was “working fine” until now.
“The link to the spider trap is rel=Nofollowed, the folder is banned in robot.txt. The spider trap works by banning by ip address, not user agent so its not caused by a faker – and of course robots.txt was setup up correctly and prior, it was in place days before the spider trap was turned on, and it’s run with no problems for months,” nippi added. “My logs show, it was the real google, from a real google ip address that ignored my robots.txt, ignored rel-nofollow and basically killed my site.”
We’ve reached out to Google for comment, and if and when we receive a response.
Meanwhile, Barry Schwartz is reporting that one site lost 60% of its traffic instantly, due to a bug in Google’s algorithm. He points to a Google Webmaster Help forum thread where Google’s Pierre Far said:
I reached out to a team internally and they identified an algorithm that is inadvertently negatively impacting your site and causing the traffic drop. They’re working on a fix which hopefully will be deployed soon.
Google’s Kaspar Szymanski comment on Schwartz’s post, “While we can not guarantee crawling, indexing or ranking of sites, I believe this case shows once again that our Google Help Forum is a great communication channel for webmasters.”