WebProNews

Tag: Robots.txt

  • Google Webmaster Tools Gets Updated Robots.txt Testing Tool

    Google has released an updated robots.txt testing tool in Webmaster Tools. The tool can be found in the Crawl section.

    The aim of the new version of the tool is to make it easier to make and maintain a “correct” robots.txt file, and make it easier to find the directives within a large file that are or were blocking individual URLs.

    “Here you’ll see the current robots.txt file, and can test new URLs to see whether they’re disallowed for crawling,” says Google’s Asaph Amon, describing the tool. “To guide your way through complicated directives, it will highlight the specific one that led to the final decision. You can make changes in the file and test those too, you’ll just need to upload the new version of the file to your server afterwards to make the changes take effect. Our developers site has more about robots.txt directives and how the files are processed.”

    “Additionally, you’ll be able to review older versions of your robots.txt file, and see when access issues block us from crawling,” Amon explains. “For example, if Googlebot sees a 500 server error for the robots.txt file, we’ll generally pause further crawling of the website.”

    Google recommends double-checking the robots.txt files for your existing sites for errors or warnings. It also suggests using the tool with the recently updated Fetch as Google tool to render important pages, or using it to find the directive that’s blocking URLs that are reported as such.

    Google says it often sees files that block CSS, JavaScript, or mobile content, which is problematic. You can use the tool to help you fix that if it’s a problem with your site.

    Google also added a new rel=alternate-hreflang feature to Webmaster Tools. More on that here.

    Image via Google

  • Google Removing Subscriber Stats Feature From Webmaster Tools

    Just like when Google announced the changes coming to API deprecation and some APIs even being retired, Google is now looking at the Webmaster tools it can phase out. The company decides these tools’ fate by seeing “if they’re still useful in comparison to the maintenance and support they require.”

    The first tool to get the boot is the Subscriber stats feature. The reported “the number of subscribers to a site’s RSS or Atom feeds.” Google already has the same features included in their Feedburner tool so they suggest users of the current Subscriber stats feature switch to that.

    The second removal is for the Create robots.txt tool. This allowed Web sites to generate a robots.txt file that would block a section of a Web site from being crawled by the Googlebot. The reason for removal is that it got very little use. Google says that those people who did use the feature can easily create their own since there are a multitude of other services that create robots.txt files.

    The last feature hitting the cutting block is the Site performance feature that’s part of the Webmaster Tools Lab. It let Webmasters check out the average load time of a site’s pages. The reason for its removal is the same as the last – low usage. If you need to check your site’s performance, Google provides the same features in the Site Speed feature in Google Analytics or Google’s PageSpeed tool.

    As you can see, these removals are more about removing redundancy than any kind of breaking changes. The retiring of these features only means that Webmasters have to switch to one of the many other options available. Chances are you’re already using one of those alternative options. If you are still using one of the above features, you have two weeks to say your goodbyes.

  • Google: Not Having Robots.txt is “A Little Bit Risky”

    Google: Not Having Robots.txt is “A Little Bit Risky”

    Robots.txt as you may know, lets Googlebot know whether you want it to crawl your site or not.

    Google’s Matt Cutts spoke about a few options for these files in the latest Webmaster Help video, in response to a user-submitted question: “Is it better to have a blank robots.txt file, a robots.txt that contains User-agent: *Disallow:” or no robots.txt file at all?”

    “I would say any of the first two,” Cutts responded. “Not having a robots.txt file is a little bit risky – not very risky at all, but a little bit risky because sometimes when you don’t have a file, your web host will fill in the 404 page, and that could have various weird behaviors. Luckily we are able to detect that really, really well, so even that is only like a 1% kind of risk.”

    “But if possible, I would have a robots.txt file whether it’s blank or you say User-agent: *Disallow nothing, which means everybody’s able to crawl anything they want is pretty equal,” said Cutts. “We’ll treat those syntactically as being exactly the same. For me, I’m a little more comfortable with User-agent: * and then Disallow: just so you’re being very specific that ‘yes, you’re allowed to crawl everything’. If it’s blank then yes, people were smart enough to make the robots.txt file, but it would be great to have just like that indicator that says exactly, ‘ok, here’s what the behavior is that’s spelled out.’ Otherwise, it could be like maybe somebody deleted everything in the file by accident.”

    “If you don’t have one at all, there’s just that little tiny bit of risk that your web host might do something strange or unusual like return a ‘you don’t have permission to read this’ file, which you know, things get a little strange at that point.,” Cutts reiterated.

    All of this, of course, assumes that you want Google to crawl your site.

    In another video from Cutts we looked at yesterday, he noted that Google will sometimes use DMOZ to fill in snippets in search results when they can’t otherwise see the page’s content because it was blocked by robots.txt. He noted that Google is currently looking at whether or not it wants to continue doing this.

  • Perfect 10 Fails Where Google Succeeds

    Perfect 10 Fails Where Google Succeeds

    Perfect10.com, a site that features incredibly attractive female models in various positions of nude repose has been long after Google because of the site’s content appearing in Google Image Searches. Their struggle has been going on for sometime now.

    In fact, WebProNews has articles dating back to 2005 discussing this very subject. However, according to the latest appeal loss, the saga may finally be coming to an end. According to a post over at CNet, the latest attempt by Perfect 10, one that seeks to punish Google for being a search engine that works as its supposed to, has been denied.

    Here’s the gist:

    The Ninth Circuit ruled that Perfect 10, a porn studio with a long history of filing copyright suits against Internet companies, rejected a request for a preliminary injunction against Google. The court said that Perfect 10 didn’t present enough evidence to prove that it would suffer irreparable harm from the photos.

    You see, Perfect 10’s content is primarily hidden behind a pay wall, meaning, in order to see their index of naked women, you have to pay for it. Unfortunately for the Perfect 10 web developer, who apparently didn’t understand how to manipulate a robots.txt file, apparently, Perfect 10 images began appearing in Google’s image search results.

    Regardless of the fact that there are an unending amount of tutorials and instructional sites that inform developers how to keep their paid content from appearing in free image searches, for some reason, Perfect 10 felt it was Google’s fault their paid content was going out to the world for free.

    In fact, Perfect 10’s claim was Google’s image search cost them something in the area of $50 million. Disregarding the fact that, again, the blame should’ve been placed directly on the head of the Perfect 10 web developer, the company tried, and tried, and tried again to make Google (and others) pay for their design inadequacies.

    Each time, these attempts did little but clog up a court system that’s already bursting at the seams.

    There was, apparently, a slight moment of victory when another judge upheld a Perfect 10 filing against Megaupload, a file-sharing site that allows others to swap files via email or direct download. Granted, Megaupload doesn’t have the money Google does, but even the smaller victories count, right?

    It should also be noted that when a “Perfect 10” search is conducted in Google Images, the amount of content originating from the site in question is negligible, even if SafeSearch is turned off. This mean that, even though the Perfect 10 web developers finally figured out how to protect their paid content, the company still wants to nail Google to the cross.

    A semi-recent post on the Perfect 10 blog reveals as much. The title, “Google Is Destroying The Entertainment Industry” reeks of a “give me back my money” approach, courtesy of Mel Gibson and South Park:


    If at first you don’t succeed in making others pay your way, try, try again.

  • Webmasters: Googlebot Caught in Spider Trap, Ignoring Robots.txt

    Sometimes webmasters set up a spider trap or crawler trap to catch spambots or other crawlers that waste their bandwidth. If some webmasters are right, Googlebot (Google’s crawler) seems to be having some issues here.

    In the WebmasterWorld forum, member Starchild started a thread by saying, “I saw today that Googlebot got caught in a spider trap that it shouldn’t have as that dir is blocked via robots.txt. I know of at least one other person recently who this has also happened to. Why is GB ignoring robots?”

    Another member suggested that Starchild was mistaken, as such claims have been made in the past, only to find that there were other issues at play.

    Starchild responded, however, that it had been in place for “many months” with no changes. “Then I got a notification it was blocked (via the spidertrap notifier). Sure enough, it was. Upon double checking, Google webmaster tools reported a 403 forbidden error. IP was google. I whitelisted it, and Google webmaster tools then gave a success.”

    Another ember, nippi, said they also got hit by it 4 months after setting up a spider trap, which was “working fine” until now.

    “The link to the spider trap is rel=Nofollowed, the folder is banned in robot.txt. The spider trap works by banning by ip address, not user agent so its not caused by a faker – and of course robots.txt was setup up correctly and prior, it was in place days before the spider trap was turned on, and it’s run with no problems for months,” nippi added. “My logs show, it was the real google, from a real google ip address that ignored my robots.txt, ignored rel-nofollow and basically killed my site.”

    We’ve reached out to Google for comment, and if and when we receive a response.

    Meanwhile, Barry Schwartz is reporting that one site lost 60% of its traffic instantly, due to a bug in Google’s algorithm. He points to a Google Webmaster Help forum thread where Google’s Pierre Far said:

    I reached out to a team internally and they identified an algorithm that is inadvertently negatively impacting your site and causing the traffic drop. They’re working on a fix which hopefully will be deployed soon.

    Google’s Kaspar Szymanski comment on Schwartz’s post, “While we can not guarantee crawling, indexing or ranking of sites, I believe this case shows once again that our Google Help Forum is a great communication channel for webmasters.”

  • Developer Shares Story of Being Threatened by Facebook for Crawling

    Pete Warden, a former software engineer at Apple, who is now working on his own start-up, posted an interesting story about how Facebook threatened to sue him for crawling the social network. I reached out to both Warden and Facebook for more details, but so far have only received response from Facebook, who calls  the incident as "violation of our terms."

    But first, Warden’s story. Read the whole thing in his words here for more context about what he wanted to do with the data, but to make a long story short, he was building a tool to bring data from email and various social networks into one place to make it easier for users to manage their contacts, and he crawled Facebook. He says he checked Facebook’s robot.txt, and that "they welcome the web crawlers that search engines use to gather their data," so he wrote his own. He was able to obtain data like which pages people were fans of and links to a few of their friends. He created a map showing how different countries, states and cities were connected to each other and released it so that others could use the information. Once Facebook caught wind of this, they threatened legal action. Warden writes:

    Their contention was robots.txt had no legal force and they could sue anyone for accessing their site even if they scrupulously obeyed the instructions it contained. The only legal way to access any web site with a crawler was to obtain prior written permission.

    Obviously this isn’t the way the web has worked for the last 16 years since robots.txt was introduced, but my lawyer advised me that it had never been tested in court, and the legal costs alone of being a test case would bankrupt me. With that in mind, I spent the next few weeks negotiating a final agreement with their attorney. They were quite accommodating on the details, such as allowing my blog post to remain up, and initially I was hopeful that they were interested in a supervised release of the data set with privacy safeguards. Unfortunately it became clear towards the end that they wanted the whole set destroyed.

    Andrew Noyes, Facebook Public Policy Communications Manager talks Pete Warden crawling Facebook dataFacebook Public Policy Communications Manager Andrew Noyes tells WebProNews, "Pete Warden aggregated a large amount of data from over 200 million users without our permission, in violation of our terms. He also publicly stated he intended to make that raw data freely available to others. Warden was extremely cooperative with Facebook from the moment we contacted him and he abandoned his plans."

    "We have, and will continue to, act to enforce our terms of service where appropriate," adds Noyes.

    Noyes pointed to Facebook’s Statement of Rights and Responsibilities, which states that "You will not collect users’ content or information, or otherwise access Facebook, using automated means (such as harvesting bots, robots, spiders, or scrapers) without our permission." That’s under the safety section, by the way.

    "I’m bummed that Facebook are taking a legal position that would cripple the web if it was adopted (how many people would Google need to hire to write letters to every single website they crawled?), concludes Warden. "And a bit frustrated that people don’t understand that the data I was planning to release is already in the hands of lots of commercial marketing firms, but mostly I’m just looking forward to leaving the massive distraction of a legal threat behind and getting on with building my startup."

    Hearing some of what both parties have to say on the issue, what are your thoughts? Discuss here.

    If we hear back from Warden or if Facebook offers us more insight into the situation, which I’m told may still happen, I’ll update this article.