Before May 23, 2014, if you would have asked me to create a search-engine-bot-crawlable JavaScript website for you, I would have tried to discourage you, vehemently. Now if you would ask me, I would tell you that that’s the way to go.
Just about all JavaScript MVC frameworks, that include AngularJS, modify the inner content of your HTML structure. This used to make the pre-rendered HTML difficult to index for search engines. However with the advancement of technology, Google and other search engines are understanding webpages better. Bot-Crawling of JavaScript, well simple JS to be precise, is no longer a major issue, and the content of more and more web apps is being indexed by search engines.
That’s awesome news for webmasters, however Google itself advises to be on cautious side.
So it’s still not time to abandon age old tricks of making JavaScript rendered content search engine optimized. There are many ways that old webmasters use to embed full SEO support for AngularJS and other application. But according to me the best method of making JS SEO friendly is by using special URL routing and creating a headless browser to automatically retrieve the HTML.
Getting Your AngularJS Apps Indexed
Though Google indexes your content automatically, you can tweak your content rendering properties in such a way that Google Bots index your content exactly the way you want. One of the simplest technique to accomplish this by serving your Angular JS content through a custom backend server.
Modern search engines and client-side apps URL
To ease the job of indexing web-app content, Google and other search engines have given a feature of hashbang URL format to web-masters. Whenever the search engine encounter a hashbag URL, i.e. a URL containing #!’ it converts it into ?_escaped_fragment_= URL where it would find full rendered HTML content ready to be indexed.
So for example, Google will turn the hashbang URL from: http://www.example.com/#!/page/content Into the URL: http://www.example.com/?_escaped_fragment_=/page/content
At the second URL, which is not originally displayed to the website visitors by the way, the search engine will find non-JS content which would be easy to index.
Now the next step is to make your application intelligent enough so that when search engine bot queries the second URL, your server should return the necessary HTML snapshots of the page. So you need to setup the following special URL rerouting/rewriting for your application.
RewriteEngine On RewriteCond %{REQUEST_URI} ^/$ RewriteCond %{QUERY_STRING} ^_escaped_fragment_=/?(.*)$ RewriteRule ^(.*)$ /snapshots/%1? [NC,L]
Here you would notice that we have setup a special snapshot directory as the redirect URL. This directory will contain your HTML snapshots of your corresponding app pages. You can setup your own directory and make the changes in the accordingly.
The next problem to tackle is instructing AngularJS to use hashbangs. Angular by default churn out URL’s with only # instead of #!. To make Angular do so just add the following module as a dependency withing primary Angular modules
angular.module('HashBangURLs', []).config(['$locationProvider', function($location) { $location.hashPrefix('!'); }]);
Creating HTML5 routing modes instead of Hashbangs
Did we mention that HTML5 is awesome? Well it is. So along with the Hashbang technique that we mentioned above, HTML5 and AngularJS combination gives us one more hack to trick search engines into parsing ?_escaped_fragment_ URLs, without actually using Hashbang URLs.
To do that first you have to instruct Google that we are actually using AJAX content, and the bot should visit the same URL using _escaped_fragment_ syntax. You can do that by including the following meta in your HTML code.
<meta name="fragment" content="!">
Then we would have to configure AngularJS so that it uses HTML5 URLs whenever and where ever it had to handle URLs and routing. You can do that by adding the following AnglarJS module to your code
angular.module('HTML5ModeURLs', []).config(['$routeProvider', function($route) { $route.html5Mode(true); }]);
Handling SEO from the server-side using ExpressJS
In our previous posts we talked about awesomeness of ExpressJS as our server side JavaScript/nodeJS framework. Well you can also use ExpressJS for our server-side rerouting instead of Apache.
To make your ExpressJS framework deliver static HTML, we will first have to setup a middleware that will look for _escaped_fragment_ in our input URLs. Once found it will instantly serve HTML snapshots.
// In our app.js configuration app.use(function(req, res, next) { var fragment = req.query._escaped_fragment_; // If there is no fragment in the query params // then we're not serving a crawler if (!fragment) return next(); // If the fragment is empty, serve the // index page if (fragment === "" || fragment === "/") fragment = "/index.html"; // If fragment does not start with '/' // prepend it to our fragment if (fragment.charAt(0) !== "/") fragment = '/' + fragment; // If fragment does not end with '.html' // append it to the fragment if (fragment.indexOf('.html') == -1) fragment += ".html"; // Serve the static html snapshot try { var file = __dirname + "/snapshots" + fragment; res.sendfile(file); } catch (err) { res.send(404);} });
Once again we have setup our snapshots in a top level directory named ‘/snapshot’. The ExpressJS also takes into account the possibility that search-engine-bot rendered URL does not have simple syntax features such as ‘/’ or ‘.html’, and thus provide the correct part to the bot.
Taking snapshots Using Node.JS
There are a lot of tools available in the market that you can use to take HTML snapshots of your web app, out of which Zombie.JS and Phantom.JS are the most used ones. These snapshots are what our would return when Google requests a URL with _escaped_fragment_ query.
The idea behind PhantomJS and even ZombieJS is to create a headless browser that access the regular URL of your web-app page, grabs the rendered HTML content when its fully executed and then returns the final HTML in a temporary file.
There are a lot of resources out there that you can guide you on how to do that perfectly by your self such as
So we are not going into detail on this. However we would certainly like to highlight a open-source tool that you use to take your HTML snapshots, Prerender.IO . You can use it as service or you can install it in your own server as the project is open source and available on GitHub
However what is even easier than that is a tool called Grunt-html-snapshot, and guess where you can found it in, Node.JS.
NodeJS comes pre-packed with Grunt tool and you can easily use it to create you own screen-shots hassle free. Here are the steps to setup grunt tool and start churning out HTML
- First install NodeJS. You can download it from http://nodejs.org. Along with node also install npm (node package manager). For Mac and Windows users NodeJS comes as click and install applications. Ubuntu users would have to extract the tar.gz file and then install it from command terminal. Those with latest Ubuntu can also install using sudo apt-get install nodejs nodejs-dev npm command. Npm comes equipped with Grunt
- Open your command console and navigate to your project folder.
- To install Grunt tool globally, run command: npm install -g grunt-cli
- You can also install a local copy of Grunt and its essential HTML-snapshot feature using the command npm install grunt-html-snapshot –save-dev
- The next step is to create you own grunt javascript file Gruntfile.js. The JS file will have following code
module.exports = function(grunt) { grunt.loadNpmTasks('grunt-html-snapshot'); grunt.initConfig({ htmlSnapshot: { all: { options: { snapshotPath: '/project/snapshots/', sitePath: 'http://example.com/my-website/', urls: ['#!/page1', '#!/page2', '#!/page3'] sanitize: function (requestUri) { //returns 'index.html' if the url is '/', otherwise a prefix if (//$/.test(requestUri)) { return 'index.html'; } else { return requestUri.replace(///g, 'prefix-'); } }, //if you would rather not keep the script tags in the html snapshots //set `removeScripts` to true. It's false by default removeScripts: true, } } } }); grunt.registerTask('default', ['htmlSnapshot']); };
- Once you have done that you can run the task using the command grunt htmlSnapshot
Grunt tool has some more features that we have skipped here. You can know more about them at grunt-html-snapshot page. You would also notice that we are giving the path to the web-app page in the task, so for it to work properly you need to first setup your website on the server and then point the task to the correct URLs. Also the snapshot here are stored automatically at the path /project/snapshots/, you can change it as per your requirement.
Site maps are also Important
For a finer control over how search engine bots access your site you need to fine-tune your site map as well. Whenever a search engine bot finds example.com/sitemap.xml, it follows the links given in the sitemap before following blindly all the links of the website. This is the best way if you want to index a page that is not linked to any other page, like mailer campaign landing pages, though this practice is frowned upon.
For AJAX content, its best to list all the pages/URLs that your app generates so that search engines indexes them properly, even if your app is a single page app. Here’s a sample sitemap
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> ... <url> <loc>http://www.yourwebsite.com/#!/page1</loc> <changefreq>daily</changefreq> <priority>1.0</priority> </url> <url> <loc>http://www.yourwebsite.com/#!/page2</loc> <changefreq>daily</changefreq> <priority>1.0</priority> </url> <url> <loc>http://www.yourwebsite.com/#!/page3</loc> <changefreq>daily</changefreq> <priority>1.0</priority> </url> ... </urlset>
AngularJS Awesomeness
With the hurdle of non-indexability out of the way, there is no reason why you cannot create your whole web pages using JavaScript. People are already relying heavily on JS and the trend is not going to stop. Earlier the major concern was HTML, but now with the solution of AJAX indexed content, you can do just about anything. Go Fly.