Tuesday, 3rd March 2009
Wait…I’ve seen this one before. Yeah, it was on my old blog before I let my blog die. Oh yeah, I let it die again btw, but whatcha gonna do?
As this is an old post, I may or may not agree with everything I said back in ‘08.
So anyway here is a vintage post:
At first glance you would assume this post is going to be geared towards hardened Google spammers and the general “blackhat” community. Whilst it is true that this will probably appeal to SE spammers much more than your average Joe, I think it will still be an interesting read for people. Myself, I find it a fascinating thought that a computer could be writing intelligent content at hundreds of times the speed of a human.
The whole notion of a computer doing something that we can call ourselves an expert in, is scary. We all (probably) accept that computers can replace humans for efficiency in a large amount of unskilled jobs. Computers accept that a new battery here and there is all they are going to get for their hard work, they accept that we want something to work 24/7 without 30 minute tea breaks every 20 minutes and well, they accept what we tell them to accept really.
Some good examples of where computers do unskilled work are: comment spam, monitoring rankings and cracking CAPTCHAs (though it is debatable whether that is unskilled work at times). The problems with trying to apply computer to tasks arise when you make that task a skilled task. Can a computer produce art? I would say no. Can a computer produce a website template? Yes. Can a computer produce an Article? Yes. Can it create inspiring poetry? No. The difference between the no and yes is creativity. The way I see computers now, are very primitive minds. I think the human mind runs on rules, just as a computer. A human mind simple has a larger number of rules. Now, I’m going to stop before I ramble on about rules for a whole article. Simply, the reason computers can write an article but not create art, is that to write correctly there is a set of rules that we can understand and put into code. Whereas we can’t put creativity and emotion into code, because we simply do not understand it yet. Now, onwards!
Lets explore the different approaches to content generation, how close they are to AI and how they fare in Google.
AI Rating: 1/10 - Whilst you could argue that copying content is just like human behaviour, It’s not really AI. I’m letting it have a 1, but only because it collects the content using a keyword, so it is related content.
Effectiveness: 7/10 - Despite Google ‘fixing’ the duplicate content issues in it’s algorithm, you can still rank a blog based entirely on duplicate content. It’s also readable to the general public and more importantly, makes sense and has a point to it.
My Rating: 1/10 - I can’t find much exciting to say about scraped content. It’s not particularly advanced, so I don’t feel very elite when I’m scraping.
Scrape & Shuffle
AI Rating: 1/10 - About as intelligent as just scraping. In fact, in terms of AI, it’s a step down from just scraping. You take intelligent content then crap all over the rules of language syntax.
Effectiveness: 4/10 - Will do a bit better in Google than the scraped content, but it isn’t readable to anyone so it makes zero sense.
My Rating: 1/10 - Meh. Don’t like it. It destroys perfectly good content.
AI Rating: 4/10 - Not bad in terms of intelligence, it understands the meaning of the word and switches the word for another word meaning the same thing. Relies on thesauri to form its intelligence, so need an intelligent human at the back end.
Effectiveness: 7/10 - Makes sense most of the time and is unique-ish content. Problems can arise if you don’t scrape enough data and start repeating sentences with only 2 out of the 10+ in that sentence altered. It’s likely to fly under the radar in the current Google landscape, but could see its arse get kicked right out of the index in the future.
My Rating: 6/10 - Good simple idea, makes sense to do it, but it’s not my favourite.
AI Rating: 7/10 - Pretty clever idea. Uses statistics it has gathered to estimate what word is likely to follow the previous word.
Effectiveness: 8/10 - If the Markov script is created so it looks back slightly further the just one word, it could produce very good content, rivaling that of a human. Though the further you look back, the more likely you are to end up with just duplicate content, unless you train your script VERY hard.
My Rating: 7/10 - Good stuff, I really like the idea of using statistics of collected data to statistically predict how content should be formed. When I first heard of a Markov script, I thought it just shuffled words randomly and I came up with the idea of somehow using statistics to predict which word should follow the last - then I found out what a Markov really did. So it’s a good idea :) (If you think I made that story up to sound clever - you can GTFO).
Scratch writing to syntax
How I generate content
I use scratch writing from syntax for my content generation, it is my weapon of choice. Here is a basic outline of how I do it.
* Step 1 - Grab keywords/phrases related to the root keyword.
* Step 2 - Run the keywords/phrases through a POS (part-of-speech) tagger to figure out what type of word they are, i.e. Verb, determiner, noun.
* Step 3 - Generate sentences based on a large database of keywords and sentence structures obtained from running the script through the wikipedia database, matching sentences to phrase patterns if necessary.
AI Rating: 9/10 - This is elite in terms of AI. It produces text based on the rules of writing a sentence, just like a human brain. It almost has an understanding of correct sentence structure and could be modified so it could actually think up it’s own sentence structures, just like a human brain. It has a large database of word with an understanding of how each fits in a sentence and around other words, just like a human brain. One thing it can’t do, is understand meanings of words which can cause problems when your sentences come out like this, “The very large ball was very small”. Though on the whole, you should get pretty good content that is readable.
Effectiveness: 8/10 - When a machine generates content using basically the same process as a human, how can you tell the difference? It is very difficult. The only things that let this approach down, is the fact it has no understanding of what words mean and it also is slower than the other methods
My Rating: 9/10 - Love it, love it, love it. It is getting very close to AI, it produces unique and on the whole, readable content.
The Content Generation Future
The way I see it evolving and the way I will be going is a combination of Markov and scratch writing to syntax. Combining the understanding of natural language syntax and the understanding of natural word patterns could produce a very powerful content generator. If you want to better understand all the elements of natural language processing and natural language generation, Wikipedia and University websites/papers are the way to go.