How to integrate Alfresco with Amazon S3

The years of 2014 and 2015 were the year of cloud. We saw huge rise in companies, both enterprises and startups alike, moving their application data and processing power from on-premise infrastructure to cloud based on-demand setups like Amazon, Google, and Microsoft Azure.

Alfresco, afraid to be left behind in the cloud race launched their own cloud based product Alfresco One and Alfresco in the Cloud. However these products come with a catch. In addition to being paid products, they are also not as customizable as Alfresco Community Edition. So to overcome that Alfresco and developers came up with solutions that allowed users to integrate Alfresco with cloud based file storage services, and in this post we are going to focus on hows and whys of integrating Alfresco with Amazon Simple Storage Services or Amazon S3.

Some Disclaimers before we continue

Before we continue here’s a warning, if you go to official Alfresco website and there documentation pages related to Alfresco Amazon integration, you will see a disclaimer

“The Alfresco S3 Connector module can be applied to Alfresco Enterprise 4.1.1 or later. It requires an Alfresco instance running on Amazon’s Elastic Compute Cloud (EC2), connected to Awazon’s Simple Storage Service(SSS). Other devices or services that advertise as being S3 compatible have not been tested and are therefore not supported.”

This gives the impression that you cannot integrate Alfresco with S3 if you haven’t installed Alfresco on EC2 instance. But that is not the case. We have integrated Alfresco with S3 even when the Alfresco instance was not installed on EC2, however it required some custom coding. Whatever we are going to discuss we have tried with the latest editions of both the Alfresco Community and Alfresco Enterprise editions and on both Amazon EC2 and private servers.

Why integrate Alfresco and S3?

First of all we have to understand that there are many positives and many negatives associated with any integration or solution. It ultimately depends upon case by case basis and what is the best solution for achieving a goal? So to understand why you need Alfresco and S3 both, let’s take a use case.

Suppose you are startup with a consumer centric web application that allows users to store high quality images and documents on your app and you are using Alfresco as your backend to manage stored documents. You are using an average server with rigid configuration and storage. You are new in your business and your user base and data stored on the app servers can fluctuate with high amplitudes. So if you suddenly see an increase in user base and the amount of inbound data, you may have to buy a new configuration from your server provider. In most cases that means investing in an expensive package that also includes a faster processor, that you may not need, and more RAM, that again you may not need.

And even that investment is risky. For example if you suddenly see a drop in customers and huge deletion in data, it may not be economical to maintain that expensive package. But downgrading in traditional servers usually require to download all data, buy new package (and that also on a monthly payment basis) and re-uploading data in the new package, i.e. some serious downtime.
The solution is to move your storage to cloud based services which allows you to scale up or down easily. There are two ways you can do that, either buy Alfresco Cloud or Alfresco one license, or invest in Amazon S3 or similar services and integrate it with your native Alfresco. This gives you document management capabilities of Alfresco along with scalability, flexibility of cloud, and security of Amazon Servers. But as we said earlier, Alfresco cloud platforms are another paid investment on top of your present ones and is quite rigid in its application, whereas a combination of S3 and Alfresco gives you more flexibility.

So to recap there are three main reason why you need to go for an Alfresco and Amazon S3 combo:

1. Scalability, Security, Speed, and Reliability of Amazon Storage Solution
2. Flexibility, customizability, and document management capabilities of Alfresco Enterprise or Community
3. Availability of a dependable Cloud Solution.

How to Integrate Amazon S3 and Alfresco?

For Alfresco Enterprise Edition

If you are using Alfresco Enterprise Edition, the integration is straightforward and easy.
Download the alfresco-s3-connector-1.1.0.2-7.zip file from Alfresco Support Portal and extract alfresco-s3-connector-1.1.0.2-7.amp connector.
Install the AMP file in the Alfresco repository WAR using the Module Management Tool (MMT) Restart the Server.
Then it is all left to configure the connector by editing /alfresco-global.properties file. You can find more about configuring the file from this link.

For more information on Installing and configuring S3 Connector, checkout this documentation by Alfresco.

For Alfresco Community Editions
This is where things get tricky. There is no plug and play module for the community edition like the Enterprise edition. You need to leverage the Alfresco Content Store allow documents to be stored in Amazon’s S3. Once again it’s advisable to use Amazon EC2 servers to host your alfresco instance but you can use your traditional or on-premise servers as well.
Now most of how we integrate Amazon S3 and Alfresco Community is proprietary code. So I will try to explain what we do and how we do it without actually sharing codes.

Alfresco Content Store is a way to manage where and how document binary files are stored through Alfresco. This is a highly customizable feature and is heavily used by developers who like to micro-manage their storage. ContentStore implementation has a sub feature called CachingContentStore class that can be used to speed up the content retrieval. We shall be using this class

Alfresco’s native ContentService uses fileContentStore to perform content read and write operations. For our solution we created a custom class and bean named CustomCachingContentStore that overrides the fileContentStore bean.

CustomCachingContentStore class also extends CachingContentStore class allowing us to automatically use CachingContentStore as well.

We also overrode getWriter() method so that whenever a new document gets uploaded it gets written to caching store starting a separate thread which writes that document in a backing store.

This backing store functionality was executed by a custom bean that we named s3ContentStore. This bean extends AbstractContentStore class and overrides following methods: getReader(), getWriterInternal(), delete(), isWriteSupported().

getReader():
This method returns an instance of S3ContentReader class which extends AbstractContentReader class.
S3ContentReader class provides a readable byte channel through which we can fetch a document based on its content url.
getWriterInternal():
This methods returns an instance of S3ContentWriter class which extends AbstractContentWriter class.
S3ContentWriter class provides a writable byte channel and adds a ContentStreamListener to write the uploaded file to Amazon S3. S3StreamListener class implements the ContentStreamListener interface and
overrides the contentStreamClosed() method where the files are written to Amazon S3.

delete():
This method deletes a document based on its content URL.

<bean id="fileContentStore"
class="com.example.alfresco.repo.content.caching.CustomCachingContentStore"
        	        	init-method="init">
        	        	<property
name="backingStore" ref="s3ContentStore" />
        	        	...
        	</bean>
        	
        	<bean
id="s3ContentStore"
class="com.example.alfresco.repo.content.s3.S3ContentStore">
        	        	...
        	</bean>

We have overridden the deletedContentStore bean to handle deleted contents.
Deleted content i.e. documents that are permanently deleted from user-trashcan will get pushed into this store, where it can be cleaned up at will.

<bean id="deletedContentStore"
class="com.example.alfresco.repo.content.s3.S3DeletedContentStore">
        	        	...
        	</bean>

In our use case we also used S3DeletedContentStore class that extends S3ContentStore class and overrides isContentUrlSupported() and exists() methods to best optimize our needs.
Now before you complain that this all is a little too vague, let me assure you that yes it was meant to be. The actual code is proprietary and still under intensive testing. Maybe in our future posts we may release the code. For the time being if you are having any difficulty or are wondering if we can connect Alfresco Community and S3 then yes its possible. If you are considering this for your project then you can contact us for expert advice.

Amazon S3 is also an investment

Before you move your setup to Amazon S3, it’s imperative to understand how you would be using it and does your application really need a scalable storage solution at this point. If you are confident that you won’t have to expand your application’s storage capabilities in foreseeable future then stick to what you know and what you have. But it’s always wise to be prepared for the future. Amazon S3 is a paid service and before investing in it, just like any decision related to money, do analyze the pros and cons. Consult experts, that may help.

References:
Alfresco Docs, Alfresc0 Wiki

Bio
Latest Posts

Pratyush Kumar

Co-Founder & President at Algoworks, Open-Source | Salesforce | ECM

Pratyush is Co-Founder and President at Algoworks. He is responsible for managing, growing open source technologies team and has spearheaded more than 200 projects in Salesforce CRM alone. He provides consulting and advisory to clients looking for services relating to CRM(Customer Relationship Management) and ECM(Enterprise Content Management). In the past, Pratyush has held consulting roles with various global technology leaders, such as Globallogic & HCL in India. He holds an Engineering graduate degree from Indian Institute of Technology, Roorkee.