Filed under: Uncategorized
You probably don’t even realize that when you have to solve two “CAPTCHA’s” (visual word images created to prevent SPAM) that you were actually helping the world to digitize and thus preserve old, crumbling books.
Ahhhh, but so much is always going on under the surface! Read on and keep on CAPTCHA-ing!
Typing in two words correctly results in the digitization of one word
By Paul Rubens
A weapon used to fight spammers is now helping university researchers preserve old books and manuscripts.
Many websites use an automated test to tell computers and humans apart when signing up to an account or logging in.
The test consists of typing in a few random letters in an image and is designed to fight spammers.
Carnegie Mellon is using this test to help decipher words in books that machines cannot read by letting sites use them to authenticate log-ins.
The test, known as a CAPTCHA (Completely Automated Turing Test To Tell Computers and Humans Apart), was originally designed at Carnegie Mellon to help to keep out automated programs known as “bots.”
Spam messages
Bots are designed by spammers to post advertisements in discussion forums or to sign up for large numbers of e-mail addresses which are later used to send spam messages.
A CAPTCHA consists of an image containing letters or numbers which have been heavily distorted, making it hard or impossible for a bot to “read.”
“ There’s still about 100 million books to be digitised, which at the current rate will take us about 400 years to complete ”
Luis von Ahn, Carnegie Mellon
By requiring web site visitors to type in the contents of the CAPTCHA before being allowed in to the site, humans can be admitted while all but the smartest bots are rebuffed.
CAPTCHAs are unpopular with many Internet users because the words they contain are often so heavily distorted to foil bots that that many humans struggle to read them.
This means potential visitors’ time is wasted while they make repeated attempts to decipher the CAPTCHA they are presented with.
But the CMU research team, based in Pittsburgh, Pennsylvania, has devised an ingenious system to put the time used interpreting CAPTCHAs to good use.
Text files
The team is involved in digitising old books and manuscripts supplied by a non-profit organisation called the Internet Archive, and uses Optical Character Recognition (OCR) software to examine scanned images of texts and turn them into digital text files which can be stored and searched by computers.
But the OCR software is unable to read about one in 10 words, due to the poor quality of the original documents.
The only reliable way to decode them is for a human to examine them individually – a mammoth task since CMU processes thousands of pages of text every month.
To solve this problem the team takes images of the words which the OCR software can’t read, and uses them as CAPTCHAs.
These CAPTCHAs, known as reCAPTCHAS, are then distributed to websites around the world to be used in place of conventional CAPTCHAs.
When visitors decipher the reCAPTCHAs to gain access to the web site, the answers – the results of humans examining the images – are sent back to CMU.
Every time an Internet user deciphers a reCAPTCHA, another word from an old book or manuscript is digitised.
Deciphered correctly
To ensure that the reCAPTCHAs are deciphered correctly, website visitors are actually presented with images of two words to examine, the contents of one of which is already known.
“If a person types the correct answer to the one we already know, we have confidence that they will give the correct answer to the other,” says Luis von Ahn, a Professor at CMU.
“We send the same unknown words to two different people, and if they both provide the same answer then effectively we can be sure that it is correct.
If they don’t agree then we send it to a lot more people to examine.”
Thanks to the adoption of reCAPTCHAs by popular websites like Facebook, Twitter and StumbleUpon, the system is helping to decipher about one million words every day for CMU’s book archiving project, according to von Ahn.
Given that it takes about 10 seconds to decipher a reCAPTCHA and type in the answer, this represents the equivalent of almost three thousand man hours a day spent deciphering words that CMU’s computers find illegible.
A handy extra benefit of this system is that reCAPTCHAs are particularly good at foiling bots while remaining legible to people.
“Firstly, we are starting with words that we know our computers can’t read,” says von Ahn. “These words have also been distorted naturally over time, and the number of ways they have been distorted is very large.
‘Distorted further’
“The more ways they are distorted, the harder it is for spammers to write software which can read them.”
To make it even harder for bots, these words are then distorted further.
“What we do is the equivalent of placing the image on a rubber sheet and pulling it to distort the geometry,” he says.
Using the reCAPTCHA system von Ahn’s team is digitising documents and manuscripts as fast as the Internet Archive can supply them, and the good news for book lovers (and bad news for spammers) is that the supply of reCAPTCHAs is not likely to dry up any time soon.
“There’s no danger of us running out of words,” says von Ahn. “There’s still about 100 million books to be digitised, which at the current rate will take us about 400 years to complete.”
Story from BBC NEWS: http://news.bbc.co.uk/2/hi/technology/7023627.stm
Filed under: Uncategorized
#1 Direct Selling SEO
#1 Direct Selling Search Engine Optimization
#2 Direct Selling Web Development
#3 Direct Selling Web Design
Now that I am also serving small businesses on the Oregon coast I am targeting new keywords. That work is in progress but I have noticed that this blog is already showing up for:
Manzanita Oregon SEO
I’ll post more search rankings as well as info on what is working and what isn’t in a few weeks.
Filed under: Uncategorized
Community members and friends often contact me with their general computer questions. I’m sure that every web professional has this happen…neighbors who call when their email isn’t working, friends who call when their computer dies, clients who email with word processing questions… I thought about charging for questions…but generally, I prefer to spread goodwill (and good Karma) and help out those who are less fortunate, so-to-speak.
I decided to start posting the most popular questions and answers to my blog, though. Here is the first in my series:
How to change the default font in Microsoft Word
Change the font style, size and color that shows up in every new MS Word document you create. (This works for Word for Macs, too.)
- In the drop down bar choose “format” and then “font.”
- Select the font and size you want and then there should be a button at the bottom, left of the font window that says “default” (see example, below.)
- Click the “default” button.
You don’t have to save the document but every new doc you open will magically use the new font for paragraph text.
Filed under: Uncategorized
Geesh…
About time. I have had to warn clients about putting their sites and navigation in flash as it totally kills their search engine rankings. The search engines simply cannot access the content contained in the flash and they cannot follow links to the pages in the animated navigation systems.
This may help. Although I still think that following simple, proven, best-practices is a good credo.
Personally, I use hard data to determine whether these things are working. I will do the same for Flash items on my client sites. The data will tell if this is, indeed, true…
So, here is the article in the news:
Uncloaking ‘invisible’ Flash Web content
Adobe announced late Monday night that it was providing optimized Adobe Flash Player technology to Google and Yahoo to help them better index dynamic Web content and rich Internet applications that include the Shockwave Flash file (SWF) format.
It sounds exciting, but what exactly does it mean for Web searchers, Webmasters, and Flash creators? CNET News.com asked Adobe, Google, and Yahoo and got some answers.
Q: What is Adobe doing?
A: Adobe is providing Google and Yahoo with optimized Adobe Flash Player technology so that their search engine spiders will be able to find and index SWF content, including Flash “gadgets” such as buttons or menus and self-contained Flash Web sites.
Q: How does this work?
A: When a search engine spider hits a normal HTML page and encounters Flash content it will load it in an optimized Flash player on the search engine server. Google has developed an algorithm that explores Flash files in the same way a person would, such as by clicking on buttons and entering input. The algorithm then indexes all the text it encounters through the navigation.
Q: How will the search experience change as a result?
A: The text that people see when they interact with Flash files, such as captions and introductions, will now be used when Google generates a snippet that appears below the URL on the search results page. The words that appear in the Flash files can now be used to match query terms in Google searches. In addition, the URLs that appear in Flash files will be fed into Google’s crawling system and be indexed.
Overall, more content will be indexed and search engine result rankings will change to reflect the additional content and its relevance. The snippets will give better information about the page on the search results. You can also expect search engine optimizers to figure out ways to improve rankings of Flash-based Web sites just like they do with HTML-based sites.
Q: Why is this necessary?
A: More than 98 percent of the Internet-connected desktops have Flash Player installed and Flash is hugely popular. Until now, the search engines were able to index some static text and links within SWF files, but much of the content was not getting indexed because of the dynamic aspect of the rich media files. Currently, all that content that was essentially invisible to the search engines will appear in the search results and the small amount of content that gets indexed appears on the search results page in jumbled words and code that are of no use to the Web searcher.
“Now, you are losing all the context of what content was near each other and running at the same time,” says Justin Everett-Church, a senior product manager for Adobe Flash Player. He likened the impact to the difference between reading the index of a book and reading the contents of the book.
Filed under: Uncategorized
Did you know that you can see if search engines are indexing all of the pages on your website (or any other website)? For my own site I’d enter site:dawnshears.com as the search string. (Just replace the url with any other one…leave out the www)
It will return all of the indexed pages of that site.
If all of your site pages are not being indexed that could be a sign that something is wrong and the ‘bots and spiders are not able to “see” all of your pages. For instance, when I see that only the home page of a site is being indexed, no other pages, it can mean a few things. Perhaps the site is new, sometimes it takes a while for all pages to get indexed. If the site isn’t new it usually means there is a structural problem. Either the site navigation or the site is in Flash or uses frames or another issue. Nine times out of ten it’s issues with Flash or frames. While it’s preferable to build sites (like I do) from the bottom up with search engines in mind, often the site was already in place and it’s not practical to completely redesign it. There are ways to “fix” an existing site to overcome these issues. I’ll address that more in another post.
Filed under: Uncategorized
Wikia: brought to you by the fine folks at Wikipedia
Wikipedia has launched their own search engine, Wikia, today. Although it won’t be a very robust resource for quite some time, Wikia is set up much differently than search giants Google, Yahoo and MSN. Just like Wikipedia, volunteer collaborators will rank and recommend websites and also exclude anything that looks “spammy,” or deceitful. In this way the collaborators become an active part of the Wikia ranking algorhythm. The founder, Jimmy Wales, anticipates that Wikia will return more valid, better quality, search results than other large engines.
This is a pretty interesting concept. It makes me think of the DMOZ (Open Directory Project) search directories (Alta Vista was powered by the DMOZ) that use people, not spiders or ‘bots, to categorize and rank material. I just know it took a long time to get sites accepted and indexed. By default, ranking systems like these favor sites that have longevity.
Will this change the way I do search engine optimization, going forward? Not really. I’ll keep an eye on Wikia and be sure to take that engine into consideration. Perhaps, I will start submitting sites to Wikia for consideration to test drive it. The same good search engine optimization practices I’ve been using should put all sites in good stead with any legitimate search system. Using only “best practices” with no deceptive or spammy tricks seems to work well across the board.
As Wikia matures, it will be interesting to see how successfully a seach engine can index and rank sites using a system similar to Wikipedia.
PC world article >>
Filed under: Uncategorized
I’ll be posting all of my research, notes and tips for web design, development, marketing and organic search optimization here.
Since it seems that I step through this process of educating each client I work with and friend I help it only makes sense to start creating a repository for all these FAQs.
Feel free to email me with any questions: dawn@dawnshears.com
Thanks,
Dawn
