When I met with a scientist from Last.fm on RecSys08, first thing he asked me about our iletken project was that: “how did you solve scalability issues? “ So, I have decided to talk about scalability and caching issues according to my experiences I had on my Internship at Turkcell Teknoloji. I will also try to explain a spesific product: Oracle Coherence, which is a clustered cache solution (or data grid). In the future, I will add some test results to this post. The main reason I am explaining Oracle Coherence is that I had the chance to work on it and they have very good examples/illustrations of data topoplogies which explains the different approaches of distrubted caches. There are many alternatives to Coherence but I do not have enough experience with them.
All the images below belong to Oracle.
If you are designing a customer oriented web application, you might want to consider the days when your users will exponentially increase. The mythical question is that will your application scale easily? Meaning that would your load capacity/performance of the application increase as you add more servers? The answer is not quite simple and don’t believe your IT department if they easily say “well… yeah sure. Why not? As we have more users, we will get faster or more machines “Scalability is actually very related to your initial design choices and it is hard to fix a non-scaleable system. So, scalability is not a hardware but a software design issue. A software architect must take scalability issues into consideration.
The nature of next generation web services requires data processing maybe more than ever. Keeping track of user interactions with system and analyzing those behaviors mean some I/O operation in the most basic sense. And if you are planning to do these I/O operations in the database at each interaction, you are probably doomed. A better solution would be using caches in the application server to get rid of I/O’s at every iteration and also to solve the bottle neck problem that might occur between your database and application server tiers. But how does caches in the application servers scale? The answer is clustered systems and the topologies used in those systems.
I wish to talk about some basics first.
When you need to improve your applications performance due to the database operations, you have two options
- Solving the problem in the Database Tier
- Using caching solutions (preferably clustered) in the Middle (Application) Tier (Application servers)
For a long time, trying to solve the problem in the database tier was the mainstream. When you decide to do that, there are again two options to improve your DB.
- a) Buy faster machines (200 CPU – Expensive ones)
- b) Combine other machines with your existing DB (clustering).
Buying a faster machine is easy but it is extremely expensive and it does not always solve your problem. It is also a single point of failure so you’ll have to support it with other systems.
On the other hand, clustering the DB is hard to maintain and costs more to scale than scaling in the application server. There are solutions like RAC systems but they all have single point of failures, hard to maintain and they do hardly scale linearly. Not to mention the respectable license fees.
I believe today, the mainstream is moving into the middle tier computing. And I think it is the right choice for many web companies. I also prefer using the 2nd option: Use clustered caching solutions in the application server because:
- Application servers are cheap (best one is around 2.000 USD)
- You don’t have to write every interaction to DB as soon as they happen. You can keep the changes at the cache and do a write-back operation. (Might depend on your business needs)
- They are said to be easy to maintain. [ I had some issues setting up the cluster nodes ]
- They scale effectively.
- If you use a well-designed caching solution: They don’t have single point of failure and all the nodes in the cluster looks same to the programmer, meaning that he will only think of one cache mechanism to ask or put objects to even if you have 2.000 servers.
I had a wonderful internship experience at Turkcell and I was in the infrastructure team. One of the assignments I had was to investigate and test Oracle Coherence. I learned the basics of the system, topologies it use to keep the data and wrote some basic Java web applications that use the cache in order to test it.
Coherence is a Data Grid solution (and more) for clustered systems.
What is Oracle Coherence 1?
This is a quote from Oracle’s Coherence product page:
“Coherence provides replicated and distributed (partitioned) data management and caching services on top of a reliable, highly scalable peer-to-peer clustering protocol. Coherence has no single points of failure; it automatically and transparently fails over and redistributes its clustered data management services when a server becomes inoperative or is disconnected from the network. When a new server is added, or when a failed server is restarted, it automatically joins the cluster and Coherence fails back services to it, transparently redistributing the cluster load. Coherence includes network-level fault tolerance features and transparent soft re-start capability to enable servers to self-heal.”
When / Why I need a Data Grid solution like Coherence?
To be continued…
[I will update the post as I find some time to write more. I am sorry that’s for now.]