Alfresco's muliti-tier structure allows a great deal of flexibility on system design and implementation architecture.

With options for differing operating systems, databases and content location, the number of options can become overwhelming.

Rather than trying to narrow the options, sometimes it is useful to cast the net wide and looking at far-reaching possibilities in order to confirm your architectural decision. Choosing the right storage solution for your implementation is vital in ensuring the longevity of your investment and mitigates the impact of future expansion. As Alfresco becomes critical to your business needs, not only will downtime be unacceptable but increase in use will see storage needs escalate beyond what you may have initially planned.

In this article we will explore some of the options for content storage for Alfresco, in particular those that may offer some advantage in the way they manage, optimise or segment data. This article will not pretend to address the (inter) connectivity or shelf technology used within your infrastructure. Whether you use FC, ISCSI, AoE or local disks is certainly an area that should be explored, and each has its merits but to add this to the mix of storage platform is, perhaps, for another article.

What does Alfresco need

It is advisable to optimise each layer of Alfresco to provide a highly performant system. Depending on the scale of your implementation, this will usually involve a clustered Alfresco 'front-end' with localised search indices, clustered databases and SAN storage environments. With any technology implementation there needs to be an educated balance between performance, reliability, availability and accessibility and this is no less important when addressing your Alfresco implementation.

For storage, Alfresco likes to see a local mount point or volume. For *nix people, this is likely an NFS mountpoint or, for Windows, a SAN/iscsi locally addressable volume. This, therefore, gives you a huge amount of possibilities for the storage subsystem. Whatever your solution, it is imperative that you put in-place a scalable storage solution that will allow, in particular, your content store to grow with your demand. With the collaborative tools, ease of access and versioning features, the amount of data held within Alfresco can rapidly push your existing storage growth expectations.

With growth in storage demands, it is paramount to revisit your backup strategy not just for the content within Alfresco, but for the database, indicies and configuration. Whilst your storage platform may, in and of itself, maintain mirror or snapshot copies of data, you must have absolute confidence, and proof, that you have the ability to restore each tier of your solution. An increase in backup size requirements, and complexity, may require extension to your backup window and increase cost of backup media. (Alfresco backup and restore).

For performance and cost reasons you may also want to consider hierarchical storage management (where you assign data to particular speed/performance or cost bracketed storage) or secondary content stores within Alfresco itself.

Recently, two file systems have been of particular interested to me for use with Alfresco. Each is massively scalable and tunable but, of course, each has its merits and its pitfalls.

ZFS

ZFS is a highly scalable and configurable file system and logical volume manager. With the ease of storage pool expansion and the ability to tailor performance based on usage, ZFS has many features that can complement an Alfresco build. By tuning ZIL settings, potentially even with SSD*, you can offset load on the main pool disks and increase synchronous write (or read) performance.

(You may want to use Richard Elling's Zilstat to maximize your ZIL performance) [*if you are using SSDs check that they can cope gracefully with power failure]

GlusterFS

GlusterFS is an open source scale-out NAS solution for unstructured data. With scalability to multipetabytes there is little chance that you will struggle with space. Although potentially complicated by the use of FUSE, the client/server architecture allows for performance increases with storage growth. With dynamic volume addition, deletion and migration GlusterFS achieves a high level of resilience to failure whilst easing management overhead. If you are providing Alfresco with high volumes of data, GlusterFS may represent a cost effective solution to growing your capacity using commodity hardware. It also runs well in cloud environments and has many features tailored to synchronisation of data.

(Note: be sure to use the most recent version of GlusterFS as earlier versions had issues with file locking which caused problems with Alfresco)

Data De-duplication

Whilst it is possible to implement checks for duplicate file upload and storage within Alfresco itself, it is also possible to utilise a de-duplication layer within the storage system. Whether this is in-line or post-processing will be determined by the technology that is used, and performance will vary dependent not only on the chosen method but on the chunking that is used.

So...

ZFS and GlusterFS represent just two of the many potential file storage platforms for Alfresco data. The solid history of ZFS and the open-source nature of GlusterFS present two varied yet like-minded approaches to scalable storage. With resilience and performance at their core and by using commodity hardware, rather than black boxes, there is a real opportunity to efficiently grow storage with demand without adversely escalating cost or vendor lock-in.

It is difficult to profess that any single piece of an Alfresco implementation is more important than another; what is certain is that the location that Alfresco data is held is of huge significance. By giving attention to storage, a firm foundation is built for the rest of the application to be built upon.

Further reading: