Tag: Search

Google Penguin Update: Google Granted Another Possibly Related Patent

Google released the Penguin update a couple weeks ago, in an effort to rid its search engine results of webspam. The update targeted the kinds of things Google has always tried to rid its results of, but the update is supposed to make Google algorithmically better at it. That, combined with the ever-refreshing Panda update, could go a long way to keep Google’s results closer to spam-free than in previous years.

Meanwhile, Google continues to secure related patents. Bill Slawski is always on top of the patents in the search industry, and recently pointed out some that may have a direct role in how Google handles Webspam. Today, Google was granted another, as Slawski points out. As usual, he does a wonderful job of making sense out of the patent.

While it appears pretty complex, and there is more to it, part of it is about how Google can disassociate spam form legitimate content, which at its most basic level, is the point of the Penguin update.

It’s called Content Entity Management . Here’s the abstract:

A first content entity and one or more associated second content entities are presented to one or more arbiters. Arbiter determinations relating to the association of at least one of the second content entities with the first content entity are received. A determination as to whether the at least one of the second content entities is to be disassociated from the first content entity based on the arbiter determinations can be made.

“It makes sense for Google to have some kind of interface that could be used to both algorithmically identify webspam and allow human beings to take actions such as disassociating some kinds of content with others,” explains Slawski. “This patent presents a framework for such a system, but I expect that whatever system Google is using at this point is probably more sophisticated than what the patent describes.”

The patent was filed for as far back as March, 2007.

To the point about human beings, which as Slawski acknowledges, could be Google’s human raters (and/or others on Google’s team), there is a part in the patent that says:

In one example implementation, arbiters can also provide a rationale for disassociation. The rationale can, for example, be predefined, e.g., check boxes for categories such as “Obscene,” “Unrelated,” “Spam,” “Unintelligible,” etc. Alternatively, the rationale can be subjective, e.g., a text field can be provided which an arbiter can provide reasons for an arbiter determination. The rationale can, for example, be reviewed by administrators for acceptance of a determination, or to tune arbiter agents, etc. In another implementation, the rational provided by the two or more arbiters must also match, or be. substantially similar, before the second content entity 110 is disassociated from the first content entity 108. Emphasis added.

The actual background described in the filing talks a little about spam:

A first content entity, e.g., a video and/or audio file, a web page for a particular subject or subject environment, a search query, a news article, etc., can have one or more associated second content entities, e.g., user ratings, reviews, tags, links to other web pages, a collection of search results based on a search query, links to file downloads, etc. The second content entities can, for example, be associated with the first content entity by a user input or by a relevance determination. For example, a user may associate a review with a video file on a web site, or a search engine may identify search results based on a search query.

Frequently, however, the second content entities associated with the first content entity may not be relevant to the first content entity, and/or may be inappropriate, and/or may otherwise not be properly associated with the first content entity. For example, instead of providing a review of a product or video, users may include links to spam sites in the review text, or may include profanity, and/or other irrelevant or inappropriate content. Likewise, users can, for example, manipulate results of search engines or serving engines by artificially weighting a second content entity to influence the ranking of the second content entity. Fox example, the rank of a web page may be manipulated by creating multiple pages that link to the page using a common anchor text.

Another part of the lengthy patent document mentions spam in relation to scoring:

In another implementation, the content management engine 202 can, for example, search for one or more specific formats in the second content entities 110. For example, the specific formats may indicate a higher probability of a spam annotation. For example, the content management engine 202 can search for predetermined uniform resource locators (URLs) in the second content entities 110. If the content management engine 202 identifies a predetermined URL a second content entities 110, the content management engine 202 can assign a low association score to the one second content entities 110.

Another part, discussing comments, also talks about spam detection:

In another implementation, a series of questions can be presented to an arbiter, e.g., “Is the comment interesting?,” “Is the comment offensive?,” “Does this comment appear to be a spam link?” etc. Based on the arbiter answers, the content management engine 102 or the content management engine 202 can, for example, determine whether one or more second content entities are to be disassociated with a first content entity item.

The document is over 10,000 words of patent speak, so if your’e feeling up to that, by all means, give it a look. It’s always interesting to see the systems Google has patented, though it’s important to keep in mind that these aren’t necessarily being used in the way they’re described. Given the amount of time it takes for a company to be granted a patent, there’s always a high probability that the company has moved on to a completely different process, or at least a much-evolved version. And of course, various systems can work in conjunction with one another. It’s not as if any one patent is going to provide a full picture of what’s really going on behind the scenes.

Still, there can be clues within such documents that can help us to understand some of the things Google is looking at, and possibly implementing.

Image: Batman: The Animated Series

May 10, 2012
Post – Google Penguin Update Content Tips Endorsed By Matt Cutts
Marc Ensign published a good blog post about staying on good terms with Google, in the post-Penguin world. There are plenty of posts out there on this topic. I’ve seen a fair amount of pretty good ones, but this one might be worth paying particular attention to.

The post, titled “Google Shakeup: Coming To A Website Near You” has a bullet list for steps to a sound content strategy. There are certainly plenty of good posts on this subject out there too, but Google’s head of webspam Matt Cutts gave something of an endorsement to this list on Twitter in a conversation with Ensign.

@MarcEnsign
Marc Ensign@mattcutts You have a sense of humor, right? Picturing you with black hair and a nose ring seemed like a good idea http://t.co/yBlnCdFQ 23 hours ago via web ·  Reply ·  Retweet ·  Favorite · powered by @socialditto

@mattcutts
Matt Cutts@MarcEnsign over the years I’ve grown a pretty thick skin. 🙂 17 hours ago via web ·  Reply ·  Retweet ·  Favorite · powered by @socialditto

@MarcEnsign
Marc Ensign@mattcutts C’mon, you know we all love you! We really don’t have a choice! 🙂 Would love to hear your thoughts on my post if you have time. 17 hours ago via web ·  Reply ·  Retweet ·  Favorite · powered by @socialditto

@mattcutts
Matt Cutts@MarcEnsign the bullet points looked solid. I haven’t seen Happy Feet 2, so I can’t vouch for that part. 😉 17 hours ago via web ·  Reply ·  Retweet ·  Favorite · powered by @socialditto

So the bullet points from Ensign’s post, Cutts says, “looked solid” include:

Create a blog and consistently build up your site into a wealth of valuable content.

Work with a PR firm or read a book and start writing legitimate press releases on a regular basis and post them on your site.

Visit blogs within your industry and leave valuable feedback in their comments section.

Link out to other valuable resources within your industry that would benefit your visitors.

Share everything you are creating on 2 or 3 of your favorite social media sites of choice.

Position yourself as an expert.

I should make a point about that second-to-last one. Sharing EVERYTHING you are creating on 2 or 3 social networks. In another article, we looked at a Webmaster Help video Cutts posted in response to a user submitted question about using your Twitter account like an RSS service for every article you post.

While Cutts indicated that doing that isn’t going to be a problem as far as Google’s quality guidelines, he said it can be annoying if you do it with every post, and you post a whole lot of content. I made the case for why it depends on how the user is using Twitter.

Just seemed worth pointing out.

Note:I know I’ve written a whole lot about Matt Cutts lately. I’m not stalking him. I promise. It’s just that webmasters want to rank in Google, and he’s obviously the go-to guy for advice, so it seems appropriate that people know about what he’s saying on these topics. Hence, our extensive Matt Cutts coverage. By the way, perusing that coverage is advised. On our Matt Cutts page, you’ll find a plethora of great advice right from Cutts.
May 10, 2012
Can Machines Produce Authoritative Search Results?

If the future of journalism is machine-produced content, it may face one major obstacle – search engine visibility. We’ve written about Narrative Science, a company whose business just happens to be providing such content, several times. In fact, earlier today, we posted a piece about the company’s attempt at bursting the “filter bubble“.

We’ve also been paying a lot of attention to Google’s quality guidelines in light of the Penguin update, which targets sites in violation of them. One of those guidelines is:

Avoid “doorway” pages created just for search engines, or other “cookie cutter” approaches such as affiliate programs with little or no original content.

If you click that “little or no original content” link, it takes you to a page with some examples. Among them: thin affiliate sites, doorway pages, scraped content and auto-generated content.

Wait a minute. Auto-generated content?

Auto-generated content: Content generated programatically. Often this will consist of random paragraphs of text that make no sense to the reader but that may contain search keywords.

Well, I don’t think Google had something like Narrative Science in mind when they came up with that, but it poses an interesting question: just how does Google feel about this kind of content? On the one hand, it is “content generated programatically”. On the other hand, it’s not going to “consist of random paragraphs of text that make no sense to the reader.”

Supposedly, the technology is getting better at writing more human-like articles. By some accounts, perhaps even less prone to mistakes than human-produced content.

Google can certainly identify with that logic (driverless cars, which are supposed to be safer than human-driven cars).

It seems fairly likely that we’re going to see more Narrative Science-like companies emerge. For example, Automated Insights seems to be traveling down a similar path. I would not be surprised to see a new wave of content farms (of the robotic kind) in the near term. Google might have some new fun to deal with in that regard.

The real question is: if the machines are producing higher quality content (or even just as high as) than humans, should this content be ranking better in search results? What do you think?

Are companies like Narrative Science going to be able to produce content that meets Panda guidelines? Google wants to provide more authoritative results. Can machines produce authoritative content?

These are the kinds of questions we’re likely to be faced with, covering the search industry.

We’ve reached out to both Google and Narrative science to dig into this further, and will update as info becomes available.

May 9, 2012
Google Search Results Pages May Soon Be Even More Cluttered

Google has been testing/experimenting with some richer search results pages for things like movies, actors, bands, books, people, etc. We wrote about it last month when a reddit user posted a screen cap, but from the sound of it, these results may become commonplace for all Google users in the near term.

Danny Sullivan at Search Engine Land reports that it “seems likely everyone may see this extended information soon.” He also says it’s likely that this is the “refresh” the Wall Street Journal reported on in March. We wrote about that here.

Basically, based on the description from the Wall Street Journal, Google would be providing more direct answer-type results that would keep more users from having to click through links to actually find what they’re looking for. This would fit that bill.

Based on the WSJ article, the content would be coming from the fruits of Google’s acquisition of Metaweb Technologies in 2010.

“With efforts like rich snippets and the search answers feature, we’re just beginning to apply our understanding of the web to make search better,” Google upon the acquisition. “Type [barack obama birthday] in the search box and see the answer right at the top of the page. Or search for [events in San Jose] and see a list of specific events and dates. We can offer this kind of experience because we understand facts about real people and real events out in the world. But what about [colleges on the west coast with tuition under $30,000] or [actors over 40 who have won at least one oscar]? These are hard questions, and we’ve acquired Metaweb because we believe working together we’ll be able to provide better answers.”

If these features do indeed become actual features, beyond experiments, it will further illustrate the fact that Google’s results pages are becoming much more cluttered, particularly when compared to Bing’s more simplistic refresh. It’s a far cry from the simple SERPs Google users got to know and love a decade or so ago, though there has been a lot of functionality added.

Of course, not everyone’s thrilled with some of the things Bing is doing either.

According to Sullivan, this particular experimental feature of Google used to come with a “Sources” label.

Image Credit: reddit user philosyche

May 9, 2012
Howard Carter, King Tut Tomb Discoverer, Honored With Google Doodle

In 1922, Howard Carter found something amazing in the Valley of Kings – the Holy Grail for Egyptologists at the time. On November 4th, Carter and his group stumbled upon steps that led to the entrance of King Tut’s tomb. By the end of that month, Carter was busy chiseling away at the sealed entrance.

Of course, “stumbled upon” might not give the explorers enough credit. The discovery of King Tut’s tomb came near the end of decades of work in the Valley of Kinfs.

For months, Carter and his team cataloged the many treasures that they found inside the tomb. Carter reportedly was unsure about the actual nature of what they had found. Was it the tomb of a King, or simply an underground treasure trove? It wasn’t until February of 1923 that Carter happened upon Tutankhamun’s burial room and got his answer.

The excavation, processing, and cataloging of everything inside the tomb took Carter and his team almost ten full years. It was an extremely rigorous process, as each item had to be given a reference number (with multiple subdivisions) and photographed from multiple angles. Each object then received its own description and sketch on its reference card. Finally, items were taken to a lab and photographed even more.

Today, the archaeologist who led this painstaking work is being celebrated with a Google Doodle. In the Doodle, the Google logo is obscured by many of the treasures that were uncovered in Tut’s tomb. Carter himself is featured in the center gazing up at the prized sarcophagus, who is casually leaning up against a column.

Carter was born in 1874 and died in 1939. He would have been 138 years old today.

May 9, 2012
Google’s Matt Cutts Talks Search Result Popularity Vs. Accuracy

Google’s head of webspam, Matt Cutts, posted a new Webmaster Help video today, discussing accuracy vs. popularity in search results. This video was his response to a user-submitted question:

Does Google feel a responsibility to the public to return results that are truly based on a page’s quality (assuming quality is determined by the accuracy of info on a page) as opposed to popularity?

“Popularity is different than accuracy,” says Cutts. “And in fact, PageRank is different than popularity. I did a video that talked about porn a while ago that basically said a lot of people visit porn sites, but very few people link to porn sites. So the Iowa Real Estate Board is more likely to have higher PageRank than a lot of porn sites, just because people link to the official governmental sites, even if they sometimes visit the porn sites a little bit more often.”

Here’s that video, by the way:

“So I do think that reputation is different than popularity, and PageRank encodes that reputation pretty well,” Cutts continues. “At the same time, I go to bed at night sleeping relatively well, knowing that I’m trying to change the world. And I think a lot of people at Google feel that way. They’re like trying to find the best way to return the best content. So we feel good about that. And at the same time, we do feel the weight, the responsibility of what we’re doing, because are we coming up with the best signals? Are we finding the best ways to slice and dice data and measure the quality of pages or the quality of sites? And so people brainstorm a lot. And I think that they do feel the weight, the responsibility of being a leading search engine and trying to find the very best quality content.”

“Even somebody who has done a medical search, the difference between stage four brain cancer versus the query grade four brain cancer, it turns out that very specific medical terminology can determine which kinds of results you get. And if you just happen not to know the right word, then you might not get the best results. And so we try to think about how can we help the user out if they don’t necessarily know the specific vocabulary?”

Interesting example. We’ve pointed to the example of “level 4 brain cancer” a handful of times in our Panda and pre-Panda coverage of content farms’ effects on search results. The top result for that query, by the way, is better than it once once, though the eHow result (written by a freelance writer claiming specialities in military employment, mental health and gardens – who has also written a fair amount about toilets), which was ranking before, is still number two.

It’s worth noting that Google’s most recent list of algorithm updates includes some tweaks to surface more authoritative results.

“So I would say that at least in search quality in the knowledge group, we do feel a lot of responsibility,” says Cutts. “We do feel like we know a lot of people around the world are counting on Google to return good quality search results. And we do the best we can, or at least we try really hard to think of the best ways we can think of to return high-quality search results.”

“That’s part of what makes it a fun job,” he says. “But it definitely is one where you understand that you are impacting people’s lives. And so you do try to make sure that you act appropriately. And you do try to make sure that you can find the best content and the best quality stuff that you can. But it’s a really fun job, and it’s a really rewarding job for just that same reason.”

Cutts then gets into some points that the antitrust lawyers will surely enjoy.

“What makes me feel better is that there are a lot of different search engines that have different philosophies,” he says. “And so if Google isn’t doing a good job, I do think that Bing, or Blekko, or DuckDuckGo, or other search engines in the space will explore and find other ways to return things. And not just other general search engines, but people who want to do travel might go specifically to other websites. So I think that there’s a lot of opportunities on the web.”

“I think Google has done well because we return relatively good search results. But we understand that if we don’t do a good job at that, our users will complain,” he says. “They’ll go other places. And so we don’t just try to return good search results because it’s good for business. It’s also because we’re Google searchers as well. And we want to return the best search results so that they work for everybody and for us included.”

Well, users do complain all the time, and certainly some of them talking about using other services, but the monthly search market reports don’t appear to suggest that Google has run too many people off, so they must be doing something right.

May 8, 2012
Google Makes Some Local Search Adjustments
On Friday, Google put out is monthly list of algorithm changes, for the month of April. We’ve taken a closer look at various entries on that list – there were over 50. Here’s our coverage so far:

Google Algorithm Changes For April: Big List Released
Google Increases Base Index Size By 15 Percent
Google Makes More Freshness Tweaks To Algorithm
Bi02sw41: Did Google Just Make Keywords Matter Less?
Google Should Now Be Much Better At Handling Misspellings
Google Tweaks Algorithm To Surface More Authoritative Results
Google Launches Several Improvements To Sitelinks

The list, along with the Penguin update and two Panda refreshes in April, is a lot for webmasters to take in. If local search is an areas of focus for you, you should find the following entries to the list among the most interesting:
- More local sites from organizations. [project codename “ImpOrgMap2”] This change makes it more likely you’ll find an organization website from your country (e.g. mexico.cnn.com for Mexico rather than cnn.com).
- Improvements to local navigational searches. [launch codename “onebar-l”] For searches that include location terms, e.g. [dunston mint seattle] or [Vaso Azzurro Restaurant 94043], we are more likely to rank the local navigational homepages in the top position, even in cases where the navigational page does not mention the location.
- More comprehensive predictions for local queries. [project codename “Autocomplete”] This change improves the comprehensiveness of autocomplete predictions by expanding coverage for long-tail U.S. local search queries such as addresses or small businesses.
- Improvements to triggering of public data search feature. [launch codename “Plunge_Local”, project codename “DIVE”] This launch improves triggering for the public data search feature, broadening the range of queries that will return helpful population and unemployment data.
The first on the above list is interesting. Subdomains for various locales may be better idea than ever now. However, the implementation and delivery of content will no doubt be incredibly important. Here’s a bit about duplicate content and internationalizing.

We actually referenced the second one on the list in a different article about how Google treats keywords. It appears that key phrases may carry less weight, at least for some searches. The local examples Google gives here indicate that this is particularly the case when you’re talking local.

With regards to the third item, it will be interesting to see just how local predictions behave. It’s certainly something local businesses will want to pay attention to and analyze as it pertains to them.

I’m not sure the fourth one will have many implications for most businesses, but it’s interesting from the use perspective, as Google looks to provide more data directly in search results.

For some more insight into local search, check out this study from a couple months back, which attempted to identify local ranking factors.
May 8, 2012
Google Launches Several Improvements To Sitelinks
We’re still digging into Google’s big list of algorithm changes released on Friday. You can read about some of the noteworthy changes in the following articles:

Google Algorithm Changes For April: Big List Released
Google Increases Base Index Size By 15 Percent
Google Makes More Freshness Tweaks To Algorithm
Bi02sw41: Did Google Just Make Keywords Matter Less?
Google Should Now Be Much Better At Handling Misspellings
Google Tweaks Algorithm To Surface More Authoritative Results

There were over 50 changes announced for April, and 4 of them had to do specifically with sitelinks:
- “Sub-sitelinks” in expanded sitelinks. [launch codename “thanksgiving”] This improvement digs deeper into megasitelinks by showing sub-sitelinks instead of the normal snippet.
- Better ranking of expanded sitelinks. [project codename “Megasitelinks”] This change improves the ranking of megasitelinks by providing a minimum score for the sitelink based on a score for the same URL used in general ranking.
- Sitelinks data refresh. [launch codename “Saralee-76”] Sitelinks (the links that appear beneath some search results and link deeper into the site) are generated in part by an offline process that analyzes site structure and other data to determine the most relevant links to show users. We’ve recently updated the data through our offline process. These updates happen frequently (on the order of weeks).
- Less snippet duplication in expanded sitelinks. [project codename “Megasitelinks”] We’ve adopted a new technique to reduce duplication in the snippets of expanded sitelinks.
That “dig deeper” link, by the way, links to Inception on Know Your Meme. You might find the other link from the list a bit more useful though. It goes to a blog post from Google’s Inside Search blog from last summer, talking about the evolution of sitelinks, when they launched full-size links (with a URLs and one line of snippet text) and an increase to the maximum number of sitelinks per query (from 8 to 12).

At that time, they also combined sitelink ranking with regular result ranking to “yield a higher-quality list of links” for sitelinks. Preusmably, it is that aspect, which Google considers to be “megasitelinks” as that is the code name of the change listed in the new list, which talks about better ranking of expanded sitelinks. The change, as noted, provides a minimum score for the sitelink based on a score for the same URL used in general ranking.

One of the changes was a data refresh, so the sitelinks gathered should be based on fresher information.
May 7, 2012
Google Tweaks Algorithm To Surface More Authoritative Results

Who you are matters more in search than ever. This is reflected in search engines’ increased focus on social signals, and especially with authorship markup, which connects the content you produce with your Google profile, and ultimately your Google presence.

Late on Friday, Google released its monthly list of search algorithm changes, and among them was:

More authoritative results. We’ve tweaked a signal we use to surface more authoritative content.

Google has tried to deliver the most authoritative content in search results for as long as I can remember, but clearly it’s been pretty hard to get right all the time. The Panda update, introduced in February 2011, was a huge step in the right direction – that is if you think Panda has done its job well. Perhaps to a lesser extent, the Penguin update is another step, as its aim is to eliminate the spam cluttering up the search results, taking away from the actual authority sites.

About a year ago, Google released a list of questions that “one could use to assess the quality of a page or an article.” This was as close as we got to a guide on how to approach search in light of the Pand update. There were 23 questions in all. Some of them are directly related to authority.

Would you trust the information presented in this article?

Is this article written by an expert or enthusiast who knows the topic well, or is it more shallow in nature?

Does this article have spelling, stylistic, or factual errors?

Are the topics driven by genuine interests of readers of the site, or does the site generate content by attempting to guess what might rank well in search engines?

Does the article provide original content or information, original reporting, original research, or original analysis?

Does the page provide substantial value when compared to other pages in search results?

How much quality control is done on content?

Does the article describe both sides of a story?

Is the site a recognized authority on its topic?

For a health related query, would you trust information from this site?

Would you recognize this site as an authoritative source when mentioned by name?

Does this article provide a complete or comprehensive description of the topic?

Does this article contain insightful analysis or interesting information that is beyond obvious?

Would you expect to see this article in a printed magazine, encyclopedia or book?

Are the articles short, unsubstantial, or otherwise lacking in helpful specifics?

Google’s Matt Cutts gave something of an endorsement to a list of tips to consider post-Penguin update, written by Marc Ensign. One of those was “Position yourself as an expert.”

Of course, we don’t know what exactly Google did to the signal (one of many, I presume) it uses to surface more authoritative content. It’s worth noting that they made a change to it, however, and it will be interesting to see if there’s a noticeable impact in search results.

It’s one thing for Google to preach about quality content, and saying that’s what it wants to deliver to users, but we continue to see Google cite specific actions it has taken to make good on that, even if we can’t know exactly what they are (Google is vague when it lists its changes). Panda and Penguin are obviously major steps, but Google seems to be doing a variety of other things that cater to that too.

I mentioned authorship. That’s a big one, and one you should be taking advantage of if you want to be seen as an authority in Google’s eyes. It really means you should be engaging on Google+ too, because it’s tied directly to it. For some authors, Google will even show how many people have you in Circles in the search results. It’s hard to dispute you being an authority if you manage to rack up a substantial follower count.

May 7, 2012
How Google Handles Font Replacement

Google’s Matt Cutts put up a new Webmaster Help video, discussing how Google handles font replacement. The video was created in response to a user-submitted question:

How does Google view font replacement (ie. Cufan, SIFR, FLIR)? Are some methods better than others, are all good, all bad?

“So we have mentioned some specific stuff like SIFR that we’re OK with. But again, think about this,” says Cutts. “You want to basically show the same content to users that you do to Googlebot. And so, as much as possible, you want to show the same actual content. So we’ve said that having fonts using methods like SIFR is OK, but ideally, you might concentrate on some of the newer stuff that has been happening in that space.”

“So if you search for web fonts, I think Google, for example, has a web font directory of over 100 different web fonts,” Cutts says. “So now we’re starting to get the point where, if you use one of these types of commonly available fonts, you don’t even have to do font replacement using the traditional techniques. It’s actual letters that are selectable and copy and pastable in your browser. So it’s not the case that we tend to see a lot of deception and a lot of abuse.”

“If you were to have a logo here and then underneath the logo have text that’s hidden that says buy cheap Viagra, debt consolidation, mortgages online, that sort of stuff, then that could be viewed as deceptive,” he adds.

In fact, that’s exactly the kind of thing that can get you in trouble with Google’s Penguin update, even if Google doesn’t get you with a manual penalty. To avoid this, here’s more advice from Google, regarding hidden text.

“But if the text that’s in the font replacement technique is the same as what is in the logo, then you should be in pretty good shape,” Cutts wraps up the video. “However, I would encourage people to check out some of this newer stuff, because the newer stuff doesn’t actually have to do some of these techniques. Rather, it’s the actual letters, and it’s just using different ways of marking that up, so that the browser, it looks really good. And yet, at the same time, the real text is there. And so search engines are able to index it and process it, just like they would normal text.”

May 7, 2012
Google Should Now Be Much Better At Handling Misspellings
Late on Friday, Google unveiled its monthly list of algorithm changes, for the month of April. As usual, there is plenty to take in, with over 50 changes. Here are some observations we’ve made so far:

Google Algorithm Changes For April: Big List Released
Google Makes More Freshness Tweaks To Algorithm
Google Increases Base Index Size By 15 Percent
Bi02sw41: Did Google Just Make Keywords Matter Less?

Improvements in how Google handles spelling issues, seems to be a major theme from April’s list. Here are the relevant list entries related to spelling:
- Fewer bad spell corrections internationally. [launch codename “Potage”, project codename “Spelling”] When you search for [mango tea], we don’t want to show spelling predictions like “Did you mean ‘mint tea’?” We have algorithms designed to prevent these “bad spell corrections” and this change internationalizes one of those algorithms.
- More spelling corrections globally and in more languages. [launch codename “pita”, project codename “Autocomplete”] Sometimes autocomplete will correct your spelling before you’ve finished typing. We’ve been offering advanced spelling corrections in English, and recently we extended the comprehensiveness of this feature to cover more than 60 languages.
- More spell corrections for long queries. [launch codename “caterpillar_new”, project codename “Spelling”] We rolled out a change making it more likely that your query will get a spell correction even if it’s longer than ten terms. You can watch uncut footage of when we decided to launch this from our past blog post.
- More comprehensive triggering of “showing results for” goes international. [launch codename “ifprdym”, project codename “Spelling”] In some cases when you’ve misspelled a search, say [pnumatic], the results you find will actually be results for the corrected query, “pneumatic.” In the past, we haven’t always provided the explicit user interface to say, “Showing results for pneumatic” and the option to “Search instead for pnumatic.” We recently started showing the explicit “Showing results for” interface more often in these cases in English, and now we’re expanding that to new languages.
- “Did you mean” suppression goes international. [launch codename “idymsup”, project codename “Spelling”] Sometimes the “Did you mean?” spelling feature predicts spelling corrections that are accurate, but wouldn’t actually be helpful if clicked. For example, the results for the predicted correction of your search may be nearly identical to the results for your original search. In these cases, inviting you to refine your search isn’t helpful. This change first checks a spell prediction to see if it’s useful before presenting it to the user. This algorithm was already rolled out in English, but now we’ve expanded to new languages.
- Spelling model refresh and quality improvements. We’ve refreshed spelling models and launched quality improvements in 27 languages.
So, it sounds like Google will be attempting to correct more misspellings, while getting better at its corrections in general.

The one called “More spell corrections for long queries” was actually the one discussed in a video Google recently shared, showing an “uncut” look inside a search quality meeting. Remember that video, which showed a bunch of Googlers using Macs, discussing the algorithm?

“As you may recall, a couple months back we shared uncut video discussion of a spelling related change, and now that’s launched as well (see “More spell corrections for long queries”),” says Matt Cutts in the announcement of April’s changes.

It’s interesting that the change was in discussion all the way back in December, and didn’t go live until sometime in April.

Also in April, Google rolled out some misspelling improvements on the paid side of search, introducing new near match types.

“People aren’t perfect spellers or typists. At least 7% of search queries contain a misspelling, and the longer the query, the higher the rate,” Google AdWords Product Manager Jen Huang recently said.
May 7, 2012
Microsoft, Yahoo Search Alliance Transition Complete in UK, Ireland, France

In mid-April, Microsoft announced that the Microsoft-Yahoo Search Alliance was nearly final in the UK, Ireland and France, after announcing the expansion into these countries in February. The ad transition began on April 18.

Today, Microsoft announced that the transition is now complete in these countries.

“Yahoo! Search is really an important part of our business and we have invested a lot of time to make the transition as efficient and as seamless as possible for our advertisers and publishers” sid Jon Myers, Director, Account Management UK and Ireland “Now we can focus on delivering compelling content for our users and customers to build relevant online experiences.

“We’re delighted to successfully reach this important milestone in the UK, France and Ireland,” said Microsoft’s Mark Richardson. “As a result of this transition we believe Bing users will see more useful advertising while presenting advertisers with an increasingly compelling alternative in search advertising. We look forward to this rolling out across the next set of European markets.”

Microsoft has indicated that Germany, Austria and Switzerland would be the focal points for the Search Alliance, following the UK, Ireland and France. The transition on the paid search side has already been completed in North America and India. For organic search, it’s already been completed globally.

Earlier this month, Microsoft announced that it has rebranded its ads for SMBs as “Bing, powered by Bing and Yahoo! Search.”

May 7, 2012
Google Penguin Update: Report Spam With Google Docs

You can learn a lot of little helpful tidbits by listening to what Googles head of webspam has to say, and lucky for webmasters, he’s always saying through various channels on the web. This includes YouTube videos, his blog, Google+, Twitter, Google’s official blogs and various forums and comments threads.

In case you wondering, according to Cutts (talking on Twitter), it’s fine if you want to send a link to a Google Docs spreadsheet when you report Penguin spam.

@winnersmedia
Matthew Kennedy@mattcutts Can we send a link to a Google Docs spreadsheet when reporting spam? #penguin 1 day ago via web ·  Reply ·  Retweet ·  Favorite · powered by @socialditto

@mattcutts
Matt Cutts@winnersmedia sure. 18 hours ago via Twitter for Android ·  Reply ·  Retweet ·  Favorite · powered by @socialditto

Last week, Cutts tweeted that Google had read and processed almost all post-Penguin spam reports:

@mattcutts
Matt Cutts@Penguin_Spam yup yup, we’ve read/processed almost all of them. A few recent ones left. 3 days ago via web ·  Reply ·  Retweet ·  Favorite · powered by @socialditto

I’m sure there have been some reports submitted since then, but clearly Google isn’t taking too long to sift through them.

May 7, 2012
Google Makes More Freshness Tweaks To Algorithm
Google has clearly placed a lot of focus on freshness in recent months, and that continues with the company’s big list of algorithm changes for the month of April. It will be interesting to see if there is a noticeable improvement in results following these changes.

Have you notices freshness-related improvements yet? Let us know in the comments.

Here are the changes Google listed today for the month of April, related to freshness:
- Smoother ranking changes for fresh results. [launch codename “sep”, project codename “Freshness”] We want to help you find the freshest results, particularly for searches with important new web content, such as breaking news topics. We try to promote content that appears to be fresh. This change applies a more granular classifier, leading to more nuanced changes in ranking based on freshness.
- Improvement in a freshness signal. [launch codename “citron”, project codename “Freshness”] This change is a minor improvement to one of the freshness signals which helps to better identify fresh documents.
- No freshness boost for low-quality content. [launch codename “NoRot”, project codename “Freshness”] We have modified a classifier we use to promote fresh content to exclude fresh content identified as particularly low-quality.
- UI improvements for breaking news topics. [launch codename “Smoothie”, project codename “Smoothie”] We’ve improved the user interface for news results when you’re searching for a breaking news topic. You’ll often see a large image thumbnail alongside two fresh news results.
- No freshness boost for low quality sites. [launch codename “NoRot”, project codename “Freshness”] We’ve modified a classifier we use to promote fresh content to exclude sites identified as particularly low-quality.
Notice that two of those are pretty much identical. Not sure if that is a mistake or if there is a subtle difference. That is the two about no freshness boosts for low quality. One of them says “content” and the other says “sites”, but the descriptions are the same.

Either way, it’s a noteworthy change, and it will be interesting to see if there is a clear impact.

As I’ve written about recently, I have found freshness to be outweighing relevancy in results sometimes, but I don’t necessarily think it’s been in relation to actual poor quality content – just when an older result makes more sense than a newer result, even if the newer one is high quality too.

Image: Parents Just Don’t Understand (via Fade Theory)
May 4, 2012
Google Penguin Update: Google Has Read, Processed Almost All Spam Reports

Google’s Matt Cutts recently tweeted that people should fill out a form to report post-Penguin spam. “We’re reading feedback,” he said.

@mattcutts
Matt CuttsTo report post-Penguin spam, fill out https://t.co/di4RpizN and add “penguin” in the details. We’re reading feedback. 5 days ago via web ·  Reply ·  Retweet ·  Favorite · powered by @socialditto

Cutts indicated in a Twitter conversation today, that Google has read and processed almost all of the Penguin spam reports:

@Penguin_Spam
Penguin Spam Fighter@mattcutts Do you guys really actually read the #Google #Spam reports tagged with #Penguin you asked to be filed or were you just saying it? 6 hours ago via web ·  Reply ·  Retweet ·  Favorite · powered by @socialditto

@mattcutts
Matt Cutts@Penguin_Spam yup yup, we’ve read/processed almost all of them. A few recent ones left. 10 minutes ago via web ·  Reply ·  Retweet ·  Favorite · powered by @socialditto

On the other side of the equation, if you were hit by the Penguin update, and think this may have been an error, Google has a form for that too.

In other news, there’s a Twitter account for Penguin spam.

More Penguin Update coverage here.

May 3, 2012
Matt Cutts: Excessive Blog Updates To Twitter Not Doorways, But Possibly Annoying

Google’s head of webspam took on an interesting question from a user in a new Webmaster Help video:

Some websites use their Twitter account as an RSS like service for every article they post. Is that ok or would it be considered a doorway?

I know he shoots these videos in advance, but the timing of the video’s release is interesting, considering that it’s asking about doorways. Google’s Penguin Update was unleashed on the web last week, seeking out violators of Google’s quality guidelines, and dealing with them algorithmically. One of Google’s guidelines is:

Avoid “doorway” pages created just for search engines, or other “cookie cutter” approaches such as affiliate programs with little or no original content.

There is no shortage of questions from webmasters wondering what exactly Google is going after with the update, which will likely come with future iterations, not unlike the Panda update. For more on some things to avoid, browse our Penguin coverage.

Using your Twitter feed like an RSS feed, however, should not put you in harm’s way.

“Well, I wouldn’t consider it a doorway because a doorway is typically when you make a whole bunch of different pages, each page is targeting one specific phrase,” he says. “And then when you land there, usually it’s like, click here to enter And then it takes you somewhere, and monetizes you, or something along those lines. So I wouldn’t consider it a doorway.”

Cutts does suggest that such a practice can be annoying to users, however.

“Could it be annoying?” he continues. “Yes, it could be annoying, especially if you’re writing articles like every three minutes or if those articles are auto-generated somehow. But for example, in FeedBurner, I use a particular service where, when I do a post on my blog, it will automatically tweet to my a Twitter stream, and it will say New Blog Post, colon, and whatever the title of the blog post is. And that’s perfectly fine.”

“That’s a good way to alert your users that something’s going on,” he adds. “So there’s nothing wrong with saying, when you do a blog post, automatically do a tweet. It might be really annoying if you have so many blog posts, that you get so many tweets, that people start to ignore you or unfollow you. But it wouldn’t be considered a doorway.”

OK, so you’re safe from having to worry about that being considered a doorway in Google’s eyes.

I’m not sure I entirely agree with Cutts’ point about it being annoying, however. Yes, I suppose it can be annoying. That really depends on the user, and how they use Twitter. I’m guessing that it is, in fact, annoying to Cutts.

Just as some sites treat their Twitter feed like an RSS feed, however, there are plenty of Twitter users who use it as such. A lot of people don’t use RSS, and would simply prefer to get their news via Twitter feed. Some users in this category (I consider myself among them) follow sites on Twitter because they want to follow the content they’re putting out. It’s really about user preference. Not everybody uses Twitter the same way, so you have to determine how you want to approach it.

Cutts is definitely right in that some may unfollow you, but there could be just as many who will follow you because they want the latest.

Either way, it doesn’t appear to be an issue as far as Google rankings are concerned.

May 3, 2012
Google Analytics Social Reports Get Backlink URLs, Post Titles

In March, Google announced the release of new social reports in Google Analytics. These included an Overview Report, a Conversion Report, a Social Sources report, a Social Plusgins report, and an activity stream tab. Today, the company announced some further expansion of social reports. Google’s now showing backlink URLs and post titles within the social reports.

“The concept of trackbacks, a protocol by which different sites could notify each other of referencing links, first emerged back in 2002,” says Ilya Grigorik with Google’s Analytics team. “Since then, the blogosphere has grown in leaps and bounds, but the requirement for each site to explicitly implement this protocol has always stood in the way of adoption. If only you could crawl the web and build an accurate link graph. The good news is we already do that at Google, and are now providing this insight to Google Analytics users.”

“These reports provide another layer of social insight showing which of your content attracts links, and enables you to keep track of conversations across other sites that link to your content,” says Grigorik. “Most website and blog owners had no easy mechanism to do this in the past, but we see it as another important feature for holistic social media reports. When you know what your most linked content is, it is then also much easier to replicate the success and ensure that you are building relationships with those users who actively link to you the most.”

The social reports are certainly welcome to Google Analytics users, and any data Google can add to the mix is a good thing, especially since so much of it is now “not provided“.

There was actually an interesting report from Poynter this week about the impact of the “not provided” data on news sites, citing Adtrak, indicating that it’s having a huge effect. Poynter revealed that 29% of its own searches in April were not provided.

The “not provided” data, of course, comes as a result of Google’s encrypted-by-default search experience for signed in users.

Hopefully some of the new data Google is offering will help ease the pain.

May 3, 2012
Google Penguin Update: A Lesson In Cloaking

There are a number of reasons your site might have been hit by Google’s recent Penguin update (formerly known as the Webspam update). Barring any unintended penalties, the algorithm has wiped out sites engaging in webspam and black hat SEO tactics. In other words, Google has targeted any site that is violating its quality guidelines.

One major thing you need to avoid (or in hindsight, should have avoided) is cloaking, which is basically just showing Google something different than you’re showing users. Google’s Matt Cutts did a nice, big video about cloaking last summer. He calls it the “definitive cloaking video,” so if you have any concern that you may be in the wrong on this, you’d better watch this. It’s nearly 9 minutes long, so he packs in a lot of info.

Cloaking is “definitely high risk,” Cutts says in the video.

With Penguin, there’s been a lot more talk about bad links costing sites. Link schemes are specifically mentioned in Cutts’ announcement of the Penguin update, and Google has been sending webmasters a lot of messages about questionable links recently. That’s definitely something you don’t want to ignore.

But while Google didn’t mention cloaking specifically in the announcement, it did say the update “will decrease rankings for sites that we believe are violating Google’s existing quality guidelines.” Cloaking fits that bill. Google divides its quality guidelines into basic principles and specific guidelines. Cloaking appears in both sections.

“Make pages primarily for users, not for search engines,’ Google says in the Basic Principles section. “Don’t deceive your users or present different content to search engines than you display to users, which is commonly referred to as ‘cloaking.’”

In the Specific Guidelines section, Google says, “Don’t use cloaking or sneaky redirects.” This has its own page in Google’s help center. Specific examples mentioned include: serving a page of HTML text to search engines, while showing a page of images or flash to users, and serving different content to search engines than to users.

“If your site contains elements that aren’t crawlable by search engines (such as rich media files other than Flash, JavaScript, or images), you shouldn’t provide cloaked content to search engines,” Google says in the help center. “Rather, you should consider visitors to your site who are unable to view these elements as well.”

Google suggests using alt text that describes images for users with screen readers or images turned off in their browsers, and providing textual contents of JavaScript in a noscript tag. “Ensure that you provide the same content in both elements (for instance, provide the same text in the JavaScript as in the noscript tag),” Google notes. “Including substantially different content in the alternate element may cause Google to take action on the site.”

Also discussed in this section of Google’s help center are sneaky JavaScript redirects and doorway pages.

May 3, 2012
Google Gives More Details On Human Raters

Google has people that it pays to rate the quality of search results. They’re called raters. Google mentioned them last year in a widely publicized interview with Wired – the interview, in fact, in which the Panda update’s name was revealed.

Given that the Panda update was all about quality, many webmasters became very interested in these raters and their role in the ranking process.

Google talked about them a little at PubCon in November, and that in December, Google’s Matt Cutts talked about them some more, saying, ““Even if multiple search quality raters mark something as spam or non-relevant, that doesn’t affect a site’s rankings or throw up a flag in the url that would affect that url.”

Cutts posted a new video about the raters today, giving some more details about how the process works.

@mattcutts
Matt CuttsVery in-depth video today about how Google uses human eval rater data in search: http://t.co/9Nhn44TP Please RT! 1 hour ago via Tweet Button · Reply · Retweet · Favorite · powered by @socialditto

“Raters are really not used to influence Google’s rankings directly,” says Cutts in the video. “Suppose an engineer has a new idea. They’re thinking, oh, I can score these names differently if I reverse their order because in Hungarian and Japanese that’s the sort of thing where that can improve search quality. What you would do is we have rated a large quantity of urls, and we’ve said this is really good. This is bad. This url is spam. So there are 100s of raters who are paid to, given a url, say is this good stuff? Is this bad stuff? Is it spam? How useful is it? Those sorts of things.”

“Is it really, really just essential, all those kinds of things,” he continues. “So once you’ve gotten all those ratings, your engineer has an idea. He says ‘OK, I’m going to change the algorithm.’ He changes the algorithm and does a test on his machine or here at the internal corporate network, and then you can run a whole bunch of different queries. And you can say OK, what results change? And you take the results the change and you take the ratings for those results and then you say overall do the return– do to the results that are returned tend to be better, right? They’re the sort of things that people rated a little bit higher rather than a little bit lower? And if so, then that’s a good sign, right? You’re on the right path.”

“It doesn’t mean that it’s perfect, like, raters might miss some spam or raters might not notice some things, but in general you would hope that if an algorithm makes a new site come up, then that new site would tend to be higher rated than the previous site that came up,” he continues. “So imagine that everything looks good. It looks like it’s a pretty useful idea. Then the engineer, instead of just doing some internal testing, is ready to go through sort of a launch evaluation where they say how useful is this? And what they can do is they can generate what’s called a side by side. And the side by side is exactly what it sounds like. It’s a blind taste test. So over here on the left-hand side, you’d have one set of search results. And on the right-hand side you’d have a completely different set of search results.”

Google showed the raters in a video last year, which actually showed a glimpse of the side-by-side:

“If you’re a rater, that is a human rater, you would be presented with a query and a set of search results,” Cutts continues. “And given the query, what you do is you say, “I prefer the left side, ” or “I prefer the right side.” And ideally you give some comments like, ‘Oh, yes, number two here is spam,’ or ‘Number four here was really, really useful.’ Now, the human rater doesn’t know which side is which, which side is the old algorithm and which side is the new test algorithm. So it’s a truly blind taste test. And what you do is you take that back and you look at the stuff that tends to be rated as much better with the new algorithm or much worse with the new algorithm.”

“Because if it’s about the same then that doesn’t give you as much information,” he says. “So you look at the outliers. And you say, ‘OK, do you tend to lose navigational home pages? Or under this query set do things get much worse?’ And then you can look at the rater comments, and you can see could they tell that things were getting better? If things looked pretty good, then we can send it out for what’s known as sort of a live experiment. And that’s basically taking a small percentage of users, and when they come to Google you give them the new search results. And then you look and you say OK, do people tend to click on the new search results a little bit more often? Do they seem to like it better according to the different ways that we try to measure that? And if they do, then that’s also a good sign. ”

Cutts acknowledges that the raters can get things wrong, and that they don’t always recognize spam.

May 1, 2012
Will The World End In 2012? Google Still Has No Official Position

Let’s assume that Matt Cutts speaks for all of Google. Ok, assumed. And as it turns out, you still shouldn’t look to Google for any predictions involving the apocalypse.

Back in March of 2010, the possibility of the world coming to an end in 2012 was already a buzzed-about topic. So much so, that it even slipped its way into a question directed at Matt Cutts as part of the Google Webmaster Help series. James Slater asked Matt if Google thinks the world will end in 2012.

Matt’s official response: “I believe Google has no official position on that.”

And as we quickly approach the end of the Mayan Calendar, it appears that his response hasn’t changed. Slater followed up with Matt Cutts earlier today, wondering if his two-year-old response was still accurate on Twitter. Here’s the new response:

@mattcutts
Matt Cutts@jamslater I’m sticking with my answer 🙂 23 minutes ago via web · Reply · Retweet · Favorite · powered by @socialditto

For their part, NASA does have an official position on the matter:

Nothing bad will happen to the Earth in 2012. Our planet has been getting along just fine for more than 4 billion years, and credible scientists worldwide know of no threat associated with 2012.

Worldwide, a recent poll showed that one in seven think that the world will end in their lifetime, and 10% linked that feeling to the whole Mayan calendar deal. As of now, Google’s lack of position is understandable. If you start seeing a direct answer in search results like the mock-up above, then it’s probably time to panic.

May 1, 2012
Experimental Search Engine Removes Top Million Sites From Your Results

Do you ever feel the search results that Google yields are too mainstream? Are you looking to explore the cavernous, cobweb-laden outer reaches of the interwebs? If you want to spend some time on some deep discovery, Million Short might be your ticket.

Million Short’s name says it all. It’s a search engine that brings back results that are a million sites short of what you’d find in Google. You can chose to remove the top million, hundred thousand, ten thousand, and on down to just one hundred from your results.

Million Short is an experimental web search engine (really, more of a discovery engine) that allows you to REMOVE the top million (or top 100k, 10k, 1k, 100) sites from the results set. We thought might be somewhat interesting to see what we’d find if we just removed an entire slice of the web.

The thinking was the same popular sites (we’re not saying popular equals irrelevant) show up again and again, Million Short makes it easy to discover sites that just don’t make it to the top of the search engine results for whatever reason (poor SEO, new site, small marketing budget, competitive keyword(s) etc.). Most people don’t look beyond page 1 when doing a search and now they don’t have to.

For instance, let’s say that I used Million Short to search “Hipster.” Gone are results from Wikpedia, Urban dictionary, WikiHow, KnowYourMeme, and even latfh.com (Look at that F*cking Hipster, a popular blog). What it has returned are various sites that I didn’t see even on the 5th page of Google search results (and I didn’t dare go past that). The lone exception was HipsterHandbook, which appeared on the 1st page of both engines.

In theory, Million Short is helping you discover stuff that you would never ever see using Google or even Bing or Yahoo!. It’s stuff that would be buried under hundreds of pages of search results. Let’s look at another example, a search for “The Beatles.”

Million Short failed to remove the top search result from a Google search of “The Beatles,” which was thebeatles.com. But everything that follows are deeper sites. Million Short removed (once again) Wikipedia, last.fm, mtv.com, apple.com, amazon.com and a multitude of lyrics and guitar tabs sites from my results.

One result I stumbled upon was from a site called suckmybeatles.com, and it’s basically a guy who really thinks The Beatles blow who posts blog entries and funny pictures detailing this (unpopular) opinion. That was well worth my time, so I guess score one for Million Short.

Million Short was brought to my attention via reddit, so let’s take a look at some of the reviews from the community (which are mixed).

Oddgenetix writes:

I just had a very rewarding experience with this thing. I searched my own name, and through pure serendipity the first result was an artist, with the same name as I. The art he paints is 50’s-60’s pin-up (the old-style classy kind, not the desperate new variety that melded with rockabilly, retro, and reality-tv-tattoo-culture.) Also really sweet looking vintage car ads for cars he imagined, and propaganda-type posters. Shit is so awesome. I threw money at him and got a few paintings, which I will be hanging in my living room, because consequently the paintings are signed with my name and I’m a pretty good liar.
TL;DR I searched my own name and found a same-name artist, so I bought his work and now I’m “a painter.”

Bullshit? Maybe. Entirely plausible for this site? Definitely.

Gsan writes:

This is a nice technique. It’s like searching a whole other internet.
Edit: this is real nice. Look at the sidebar of the sites it blocked, and tell me how many of those you think had what you were looking for? For me the side sites are mostly online stores, and cheap sites like ehow.com and about.com. Good riddance. Google and Bing seem to think I want to buy everything I’m searching for and they really want me to buy it at Amazon.

DrizztDoUrden writes:

This is actually pretty sweet. It reminds me of the gopher days when it was nearly impossible to get exactly what you wanted, but you would learn so much more from the journey.

But FLEABttn writes:

It’s like a search engine designed to find things I’m not looking for!

And nkozyra writes:

These results were shockingly terrible.

Look, Million Short is obviously no Google killer. It’s not even a Yahoo killer. It’s an alternative search engine for people wanting a unique search experience. If you’re looking for popular, relevant information and you want it fast, it’s probably not the way to go. If you’re looking to find some random corners of the internet, it might tickle your fancy.

Just be prepared to find stuff like this as your top result (h/t reddit). ಠ_ಠ

May 1, 2012