AOL Releases Massive Amounts of Private Data

AOL must have been ignorant to the recent shoot-off between DOJ and Google and in what seems to be a BIG mistake has released data of 20 million web queries from 650,000 AOL users.

The sensitivity of the case in general is that user data was released without permission and needless to say, some rightly scary data has floated up.

image

I saw a few places where AOL has carefully explained the situation and apologized for the seemingly accidental error. For example from The Paradigm Shift,

All –

This was a screw up, and weÂ’re angry and upset about it. It was an innocent enough attempt to reach out to the academic community with new research tools, but it was obviously not appropriately vetted, and if it had been, it would have been stopped in an instant.

Although there was no personally-identifiable data linked to these accounts, weÂ’re absolutely not defending this. It was a mistake, and we apologize. WeÂ’ve launched an internal investigation into what happened, and we are taking steps to ensure that this type of thing never happens again.

Here was what was mistakenly released:

* Search data for roughly 658,000 anonymized users over a three month period from March to May.

* There was no personally identifiable data provided by AOL with those records, but search queries themselves can sometimes include such information.

* According to comScore Media Metrix, the AOL search network had 42.7 million unique visitors in May, so the total data set covered roughly 1.5% of May search users.

* Roughly 20 million search records over that period, so the data included roughly 1/3 of one percent of the total searches conducted through the AOL network over that period.

* The searches included as part of this data only included U.S. searches conducted within the AOL client software.

Our apologies again.

Andrew Weinstein
AOL Spokesperson

I think from now on there will be a very close eye on AOL and I’m sure bloggers as well as newspapers won’t stop talking about this for a while.

Continue Reading August 7th, 2006

Google Launches Project Hosting

It s common knowledge that Google likes open source software. With the search engine s latest move to create an open source repository it s getting behind the open source community in a big way. For various reasons however it s getting mixed reviews….

(Advertisement) Protect your software for the entire lifecycle. Only Unified Software Protection from SafeNet gives you complete security from the development stage through fast and flexible licensing, to distribution and beyond. Lock down your software– click for a free whitepaper on securing software revenue.

Continue Reading August 7th, 2006

Using Link Pathway Segments To Identify Potential Resources

Last week I described Link Pathways, which are comprised of all the documents between two documents that are connected by a chain of hypertext links.  I also explained Link Mass, which is comprised of all the links that — through various Link Pathways — lead to any given document.  Measuring your pages’ true Link Mass is virtually impossible, but you can measure (and work with) Relative Link Mass.

You can measure a select group of Link Pathway Segments.  A segment is a pathway of fixed, arbitrary length.  Many SEOs actually study Link Pathway Segments 1 document long when they look at who is linking to the documents that link to their pages.  If you’re analyzing a competitor’s link profile, they may potentially have 100,000 links going back 2 levels.

It’s not feasible to look at 100,000 linking documents, though.  Many of them probably include a lot of outbound links.  The more outbound links a page has, the less traffic it tends to confer to any given outbound link, although placement on the document influences which links get the most traffic.  So, while you ideally want inbound links from high traffic documents with few outbound links, you’ll rarely receive such links.  Worse, such links are often transient — they only last for a short length of time.

If a Web site gives you 6 transient links in a year, the pathway is relatively persistent.  The segment from that page to your page still exists.  A Link Pathway may change its scope over time.  Documents can drop out and join into a Link Pathway.  The environment is as dynamic as the whole Web.  So, when evaluating Link Pathway Segments you have to consider that the pathways may flow through main pages that change content frequently.  A virtually persistent pathway segment is as good as a persistent pathway segment. In fact, it can be better (because you often get archived linkage).

Link Pathway Segments tend to be less volatile than full Link Pathways.  They are also easier to measure.  You can measure which Link Pathway Segments are the most valuable for your needs by comparing their Relative Link Mass.  Relative Link Mass can be captured in various ways, but the simplest way is to identify a terminating document for a segment and count its backlinks.  Let’s do a sample comparison.

Example 1:  http://searchenginewatch.com/ links to its SearchBlog secton which sometimes links to SEOMoz.  SEOMoz includes a link to my profile page.  My profile page links to my official Web site.  The main page of my official Web links to my "articles and essays" pages, and those pages link to numerous Web sites that carry articles and/or essays I’ve written.

That Link Pathway Segment takes you from a very important, very popular resource (Search Engine Watch) to relatively obscure content, but it passes through three domains that are all fairly link popular (in descending order of popularity and importance).  Since I have written some articles about search engine marketing, you can remain on topic all the way down to some of the more obscure sites.

Example 2:  mydogharriet.blogspot.com has a post from March titled "Take this survey and save my dog, Harriet".  From that page you can get to mommybloggers.com.  From there you can get to blogher.org.  Today I can click on a post by Elena Cantor titled "Love, American Style" and I can follow a link from there to queendom.com.  There I am at a dead end.  I cannot easily find any outbound linkage.

My topic for the first Link Pathway Segment was concerned with search engine marketing.  My topic for the second Link Pathway Segment was for "save my dog".  But we didn’t stay on topic.  What we can take away from this comparison is that some Link Pathway Segments wander all over the topic map.  You can get a spoof post about a non-existent survey that leads to a mother’s blog that leads to an essayist blog, etc.

The Relative Link Mass for MyDogHarriet is pretty low (divide its fewer than 500 backlinks by 25 billion).  Queendom.com’s Relative Link Mass is about 100 times greater than My Dog Harriet’s Relative Link Mass, but there is no immediately obvious outbound link.  Queendom won’t be a very good resource for non-search traffic for most sites.  Nor will many of the sites leading to Queendom provide a typical commercial site with much non-search traffic.

Generally speaking, I pick 5-10 segments, selected randomly from a page’s backlinks, to build a profile for that page’s audience.  Building an audience profile from Link Pathway Segments tells you where non-search traffic comes from.  While it doesn’t matter how deep you go, I like to look 3-4 pages deep.

For non-search traffic, you want links from high traffic pathways with few outbound links.  Most personal blogs are poor sources for those kinds of links.  To state the obvious, for a link building campaign, you need to identify commercially oriented blogs that focus on the general topic.  Mommies are wonderful people, but they tend to wander all over the Web with their linkage.  Even so, getting commercial blog linkage is still challenging in more than one way.

I’ll leave you with one last example.  http://googleblog.blogspot.com/ is a highly visible commercial blog.  The "What we’re reading" section, where they link out to non-Google sources, provides 32 links. I’m pretty sure that you can get to SEOMoz from about 5 of those links.  You have about a 1-in-6 chance of landing on a site that may provide a transient link to SEOMoz.

They list nearly 40 blogs by Googlers.  I’ve only seen SEOMoz mentioned once on one of those blogs. So if you take all external links into consideration, the odds of landing on a page that links to SEOMoz drop to less than 1-in-10.  These are all persistent links.

So, assuming you get to one of those sites, and that you manage to follow a link from there to SEOMoz, there are more than 30 links in the margin navigation alone, of which one leads to my SEOMoz profile, which links to my site.  The good news is that you can get to michael-martinez.com from Google’s blog in about 4 or 5 hops.  The bad news is that the probability of anyone making those exact hops is pretty small — although my referral data indicates I get quite a bit of traffic from SEOMoz.

There is relatively little flexibly targeted focus from the Google Blog and Search Engine Watch for a search marketing guru.  Google Blog is reluctant to link out to many people; Search Engine Watch links out to tons of people.  I get traffic from SEOMoz mainly because I’m a contributor with persistent linkage (I also get referrals from Rand’s search ranking factors article).

For non-search traffic, choosing your links is differs from building search engine linkage because some links just won’t help.  Evaluating Link Pathway Segments tells you what audience is interested in a potential linking source, which potential sources have high visibility (higher Relative Link Mass equals higher visibility), and which potential linking sources offer strong, flexible focus in your targeted vertical.  When I resume the SEO Strategies Series, I’ll show how to use your Link Pathway Segment research to guide linking choices.

Continue Reading August 7th, 2006

A Completely Irrelevant, Off-Topic Post (that should brighten your day)

Everyone who’s seen this short Flash video has been quoting pieces to me non-stop. I figured it was worth spreading the virus:

David's New Snail

The video is courtesy of Mike Adair - who clearly has his priorities in order. Bonus points to anyone who spots me in San Jose and tells me to buy things at "the other store."

Continue Reading August 5th, 2006

Google Webmaster Central - An In-Depth Review

Poor Vanessa Fox, working late on a Friday night. I felt so bad for her, I just had to post this timely review of Google’s new Webmaster Central, which is replacing sitemaps (and including its old functionality). There’s a lot of material to cover, so let’s dive in:

Google Webmaster Central Home

Above, we can see the start page for Google Webmaster Central once you’ve verified a site (this can be done by uploading an HTML file, or by editing a meta tag on the site). Once I’m in, I see both www.seomoz.org and seomoz.org are listed as verified, and there’s the additional tools, including report spam and submit a re-iniclusion request. Almost ironic in its dichotomy, eh?

 Google Webmasters Central Summary

Next, we’ve got the indexing summary page, showing me the variety of problems that Google’s had when crawling the SEOmoz.org site. They give you the last crawl date and let me know that I should consider submitting a sitemap and reviewing the files they’re having trouble accessing.

HTTP errors in Google Webmasters Central

When we dig a bit deeper, we can see exactly which pages they had trouble crawling, and the errors they received. It looks like Kat’s link ninjas photos from earlier this year were disabled, though I’m having no trouble seeing the branding page that’s listed first. I like that they also show you when they tried to reach the problematic page.

After the web crawl, we can see things like the robots.txt analysis and managing site verification - good features, but basic enough that they don’t need in-depth coverage. However, the preferred domain feature is really groovy.

Google Webmasters Central Preferred Domain

Note that I can select to display either SEOmoz.org or www.SEOmoz.org in the SERPs. Hopefully, this also helps with link canonicalization and duplicate content issues that could potentially hurt sites who don’t use the 301 re-direct to send url.com to www.url.com (or the reverse).

After I clicked "preferred domain," I saw a new link on the menu called "crawl rate," but clicking it just resulted in an error:

Google Webmasters Central Crawlrate

Those hard-working Googlers just have too much to do…

Google Webmasters Central Query Stats

Query stats is one of my favorite features - it’s not as in-depth as a good analytics program, but it does show ranking positions, which is very nifty. They also seem relatively accurate, though I saw a few where I can’t imagine I ever ranked well for them - i.e. "no access." Additionally, the functionality of monitoring queries from different geographic locations is useful, since it can help you understand your geotargeted rankings and focus in Google’s eyes.

Google Webmasters Central Crawl Stats

Crawl stats are also helpful, though I’m in doubt about why the blog would be the highest PageRank page. My guess would have been the beginner’s guide, as it’s so well linked to off the site, and so popular in what folks are visiting. It’s no surprise that the vast majority of pages are very low PR, since there’s 1200+ blog entries with only a few inbounds each.

Google Webmasters Central Page Analysis

Page analysis is interesting since I can see both the types of pages Google sees on the site, and the semantics of what they think the site is about. There’s what I most commonly say on the site AND what others are most commonly using to describe me in their anchor text - damn cool.

Google Webmasters Central Add Sitemap Page

The add a sitemap page is exactly what it sounds like. You can send Google an XML file and they’ll use it to crawl the pages on your site they may not have indexed (or may not be crawling frequently enough for your tastes). I’ll try to get feedback on how well this is working next time we’re part of a big launch.

Google Webmasters Central Index Stats

 Finally, we’ve got the index stats, which just link out to the searches - not particularly valuable for savvy SEOs, but certainly interesting to those who don’t know how to use these queries. I think that a pop-up question mark describing what each query is showing you would probably add a bit more value for those who are seeking to understand what they’re searching.

All in all, an amazing system - robust, professional and useful. Great work Vanessa & Amanda (and all of Webmasters Central team). Your new moniker is a bit lengthy for my tastes, but it’s certainly a more accurate descriptor of what the "sitemaps" part of Google can do.

What do you think? Will you be signing up your sites for inclusion?

Continue Reading August 5th, 2006

Massive Shakeups in the Corporate SEO World

The big names in SEO are taking the week before SES San Jose as an opportunity to make drastic changes in their lives. Here’s the rundown:

And, on a personal note, Gary Price is gettin’ hitched!

It must be an unwritten rule that big news should come just before a big show. If you run an SEO firm, disconnect all communications devices to ensure that your important employees don’t make a run for it.

Continue Reading August 4th, 2006

SEO is One of the Hot, Up & Coming Jobs

Thanks to Ryan McCann, who pointed me to this article from MSN Careers - Four Jobs on the Cutting Edge. From the article:

Search engine optimizers (SEOs) increase a firm’s Web site traffic by improving its search-engine page rankings. This is an especially important task in today’s Internet-driven world, where many customers first learn of an organization and its products or services through the Web. Because of a shortage of experts in this relatively new area, many top SEOs receive multiple job offers.

SEOs typically supplement their knowledge of how various search engines operate and determine page rankings with strong marketing skills, as well as the ability to communicate effectively and program using HTML. Most are self-taught, learning the trade by researching trends, attending conferences and seminars, participating in discussion forums, and experimenting with their own sites. Courses and certifications in this specialty are being offered by an increasing number of organizations; however, consensus on the value of these programs does not yet exist.

This looks like a remarkably accurate description of the profession, actually. I’m quite impressed with Robert Half, International (who authored the piece).

As far as SEO for a job choice, I know I wouldn’t have it any other way. Thanks to the people who blazed the trail in our industry, this is one of the most fun, engaging, dynamic and profitable professions I’ve ever seen, heard or read about. And, I gotta tell you, Intellectual Property Litigator sounds like one of the most despicable vocations you can have - it may be in demand, but it sounds like going to work for the villains of the tech world.

Do you think the secret’s almost out? Will universities start to offer classes? Are we in for a huge influx of CareerBuilder readers gone wild?

Continue Reading August 4th, 2006

Levels of Search Marketing Knowledge

I’m stealing this idea straight from these three guys (1, 2, 3) and adapting it to the world of SEO/M. Here’s my take on what the various levels of knowledge are in the search microcosm - I’ve tried to provide examples as well:

Level 1: Confused
New website owners or developers who think to themselves… "Hey, don’t I need meta tags to get Google to like me? And maybe I should buy that $19.99 submission service to the top 40 search engines - after all, if it ranks #1 at Google, it must be the best advice around!" fit this category quite well.

Level 2: Learning
These are the folks you see attending their first SES show, posting and reading at the SEO forums and generally taking a deeper stab at the world of search marketing. Some of these people have even bought AdWords or Yahoo! Search Marketing ads, but haven’t played with it much.

Level 3: Novice
If you’ve read the beginner’s guide, you’re automatically at this level (maybe even level 4 if you paid close attention). These people know basics like clean URLs, internal linking and title tags are critical to rankings. They also have a basic understanding that links influence rank, though they may still be stuck on PageRank in the toolbar as a metric.

Level 4: SEO Newbie
At level 4, you gain the title of "SEO" - you know why the meta description tag matters, why some links are better than others and how to create a search-friendly site from the ground up. You may even have a few tricks up your sleeve and have some deeper knowledge about a particular niche  - where the good links come from, why certain sites rank where they do, etc.

Level 5: SEO Professional
At level 5, you know enough to provide professional help to others. 301 re-directs are a new best friend and you’ve done link building work to a degree that makes you competent in nearly any niche. Keyword research is now a basic task and fixing up search-unfriendly features like select boxes and multiple URL parameters is old hat.

Level 6: Master SEO
Sixes are in a unique club. Chances are that to have this much knowledge, you read 10-20 SEO blogs each day. You’re so plugged in to the search world that you’ve got a fair idea of what Google’s last update affected in the algo and what types of spam are still effective at each engine. You’re a whiz at diagnosing penalties, finding solutions to difficult search problems and quickly dominating the SERPs in less competitive sectors.

Level 7: Dark Lord of Search
I would assign this level to a select few in the world of SEO. At this level, you’re often someone who’s well-known in the SEO community - you might run a blog or simply be a "big time" name in the field, regularly speaking at conferences and getting pegged by Fortune 500 brands to provide insight. These guys & gals know exactly where to place links or create pages to help overrun a negative piece of press and have the connections to make nearly any type of campaign possible, assuming the motivation & money exists.

Level 8: Johns, Boser & Ballie
I’ve only met three people I’d put at level 8 - Ammon Johns, Greg Boser & Jake Baillie. These guys know not only search, but coding, marketing & business like the back of their hand. Show them a problem and they’ll come up with a solution that solves ten issues you didn’t even know you had. Show them a sector and chances are they’ve optimized at least a few sites to top rankings in it (assuming it’s profitable). From spam to technical analysis to scraping, re-direction, content strategies and analytics, they are nothing less than the very best our field has to offer.

So, where do you rank yourself?

p.s. I would say that a ranking chart like this isn’t as relevant in our field as in others, as great levels of specialization in a particular sector must be weighed against the brilliant generalists, but it makes for a fun post, eh?

Continue Reading August 3rd, 2006

Matt Cutts: Part 4

Day 4. I can hear the walls. They all sound like Matt Cutts. I know every language family from around the world. My keyboard mocks me. Sweet merciful weekend, will you never come? Can’t stop…can’t stop transcribing…


Matt proudly displays his werewolf vs. unicorn shirt, which depicts the silhouettes of a werewolf and a unicorn mid-battle in front of a full moon. Underneath the image is the caption “It’s on now”. Matt says, “We’ve got the unicorn and the werewolf. Mortal enemies since the beginning of time, and it’s on now!”

1.  For all the data center watchers out there, should all the results across one Class C IP address block be the same most of the time, except when you’re pushing data, or are they supposed to be different because you’re trying different things on them? And would it make more sense to use the direct IP addresses when reporting issues or problems, or the 41 gfe data center names?

Let’s talk about datacenters. Back in the days of the dinosaurs, (here Matt does either an impression of a dinosaur, complete with screeching and clawing, or he just watched some Japanese anime and had an epileptic seizure), when the dinosaurs roamed the earth, you could actually run a search engine off of one computer. Those days are long since gone, unless you have a really, really powerful computer or something very, very small to search over, or you have a Google search appliance, I guess.

These days, you pretty much have to have a data center. In the early days of data centers you could just do some sort of "round robin" trick with DNS so that you would always hit different data centers, Google does some very smart stuff with load balancing, some very interesting techniques to try to make sure that different data centers are able to perform well.

So, your basic question was this: Should all things on the same Class C IP block be roughly the same? And yes, they should be roughly the same in that they’re typically the same data center, but not always. So, let me give you a couple examples. If one data center has to fail over, or if one data center is out of the rotation, then even if you’re going to one IP address, you can get bounced over to a different data center. Even though if you look like you’re consistently hitting the same data center, behind the scenes, underneath Google’s load balancing, you could be hitting a different data center completely. Those situations are somewhat rare, but not that rare. So that’s sometimes when you see people having debates online on Webmaster World, Data Center Watch, stuff like that, they can actually be seeing different things even if they hit the same IP address.

The other point I wanted to make, and I made this at PubCon Boston, was that the data centers often have a lot of different things. So, whenever there’s a new algorithm update or some other feature that we are trying out, we’ll often try it on one data center first to make sure the quality is what we expect it to be based on evaluation, stuff like that. The data centers do differ according to some very complex, intricate plans so we can try out different things at different data centers. Typically, on one class C IP address you will usually hit the same data center, but that’s not guaranteed.

Also, at PubCon Boston I showed an example of the sorts of things that are going on in different data centers. It sort of shows how things are a lot more intricate now than they used to be. So, Google does a lot more smarter  scheduling, and it’s a lot harder for a random person to just look at a data center reverse engineer or try to guess which way things are going, stuff like that.

As far as which IP address versus the gfe name, which, I think exactly me and g1smd know about, nobody else has really bothered to talk about it that much, except maybe on Webmaster World, you can use either the IP address or the two letter code of a data center because we’re able to map them both back. So, if you tell us one we can tell what the other one is, either way.

In general though, there are probably better ways to spend your time than watching data centers. I think it’s a good use of your time to work on your content, a good use of your time whenever there’s something major going on, if you really want to look whenever there’s a page rank update or something going on, but in general, there’s enough stuff going on in different data centers that I would say it’s probably not worth checking every single data center every single day. Trying to figure out "Okay, how am I going to do, or how have I been doing", it’s probably better to spend a little more time paying attention to your logs, and work backwards based off of that. 

2.  Is it possible to search for just home pages? I tried doing "-inurlhtml" and "-inurlhtm", blah blah blah, PHP, ASP, but that doesn’t filter out enough.

That’s a really good suggestion, Peter. I hadn’t thought about that. Fast(?) used to offer something like that, but I think all they did was look for a tilde in the URL. I will file that as a feature request, and see if people are willing to prioritize it where they might be able to offer that. My guess is it would be relatively low on the priority list because of all the syntax you mentioned, subtracting off a bunch of extensions, but would probably work pretty well.

3. Clarification

Ah, I get to clarify something about strong vs. bold and emphasize vs. italic. So, there was a previous question where somebody asked about whether it was better to use bold or whether it was better to use strong because bold is what everybody used in the old days when the dinosaurs roamed the earth, and strong is what the W3C recommends, and at that time last night, I thought that we just barely barely barely, in epsilon, preferred bold over strong, and I said "For the most part, don’t worry about it."

The nice thing is an engineer actually took me to the code where I could actually see it for myself, and Google does treat bold and strong with exactly the same weight. So, thank you for that, Paul, I really appreciate it. In addition, I checked, he also found code that shows that em, as in emphasize, and italic are treated exactly the same as well. So, there you have it. Go forth and mark up like the W3C would like you to do. Do it, do it semantically well, and don’t worry so much about krufty(?) old tags because Google will score it the same either way.

4.  Will we see more kitty posts in the future?

I think we will; in fact, I tried to get my cats in on this shot and to sit still, but they were a little scared of the lights, so we’ll see if I can get them used to it.

5.  What are Google SSD, Google Gas, Google RS2, Google Mobile Marketplace, Google Weaver, and other services discovered by Tony Ruscoe?

I think it was very clever of Tony to try to do a dictionary attack against our services check in, but I’m not going to talk about what those services are.

6.  [What might be some of the topics] in the Duplicate Content Session of SES?

I gave a little bit of a preview in one of the other sessions, on video, but I think what we will basically talk about, Jerry will be there, a lot of other people will be there, and we will talk about shingling. What I will essentially say is that Google does a lot of duplicate detection from the crawl, all the way down to the very last millisecond, practically when a user sees things. We do stuff that’s exact duplicate detection, and we do stuff that is near duplicate detection. We do a pretty good job all the way along the line of trying to beat out dupes and stuff like that.

The best advice I’d give is to make sure that your pages that will have near  the same content, look as much different as possible. If they really are truly different content, a lot of people worry about printable versions, or somebody else asked about a .doc word file compared to an html file. Typically, you don’t need to worry about that. If you have similar content on different domains, maybe in French and another version in English, you really don’t need to worry about that.

Then again, if you do have the exact same content, maybe for a Canadian site,and for a .com site, it’s probably just the sort of thing where we’ll detect whichever one looks better to us and just show that, but it wouldn’t necessarily trigger any sort of penalty or anything. If you want to avoid it, you can try to make sure that your templates are very, very different. But, in general, if the content is quite similar it’s better just to let us show whichever representation we think is the best, anyway.


7.  Does Google index or rank blog sites differently than regular websites?

Not really. Somebody else asked about links from .govs and .edus, and whether links from 2 level deep .govs and edus, like .gov.pl, were worth the same as .gov, and the fact is we really don’t have much in the way to say, "Oh, this is a link from the ODP, or .gov or .edu, so give that some sort of special boost." It’s just that those sites tend to have higher Page Rank because more people link to them and reputable people link to them. So, blog sites, there’s not really any distinction, unless you go off the blog search, of course, and then it’s all constrained to blogs. In theory, we could rank them differently, but, for the most part, just the general search, the way it crawls out ends up working out okay.

Continue Reading August 3rd, 2006

Branding Your Site: Free Yourself from SERPs!

Is this topic necessary After all it is the Internet just put up a site offer content optimize for search engines and voila You have hits. Well maybe you have hits and maybe you don t….

(Advertisement) Protect your software for the entire lifecycle. Only Unified Software Protection from SafeNet gives you complete security from the development stage through fast and flexible licensing, to distribution and beyond. Lock down your software– click for a free whitepaper on securing software revenue.

Continue Reading August 3rd, 2006

Next Posts Previous Posts


Categories

Contributors

Business 1st

Blogs