Newsletters Welcome, Guest Log In | Register

Subscribe

Sign up now and get the best business technology insights direct to your inbox.

  • Daily Edge
  • CTO Edge Update
  • Business Tools & Templates
  • Aligning IT & Business Goals
  • Maximizing IT Investments

Be a Guest Author

Have an opinion you would like to see published here?

0

Log Data Deluge: Managing Each and Every Click

by Gary Orenstein, MaxiScale
Aug 21, 2009 12:05:43 PM

The proliferation of online advertising across the Web has had a dual impact. First, it has enabled companies to gain extensive insight into users’ online behaviors, which creates a slew of opportunities to specifically cater ads to a well-defined target market. However, it has also increasingly created challenges for IT in capturing and managing ever-expanding loads of log data. With the wealth of valuable data now available, it is up to data center and application managers to adequately track and analyze this critical information to improve ad relevant and best-in-class user experiences.


Log data capacity is already too much to handle for today’s insufficient storage and file system technologies. Newer architectures — supporting the tracking and analysis of every single click by every single user — are needed for organizations to better manage this information. A highly optimized distributed file serving infrastructure can help alleviate the woes experienced by social networking, photo sharing and ad serving companies — and in the long run, prevent them from drowning in their own log data.


Challenges with Batch Log Uploads


Typically, logs are captured locally on each Web server. Large Web properties often have hundreds to thousands of Web servers, each of which uploads portions of its log data to a centralized storage repository at a set interval.

 


This process has several drawbacks, most notably:

  • Excessive client coordination to avoid storage overload. Web servers must be manually set not to upload log files at the same time to avoid bottlenecks
  • Custom rsync scripts required for fine-tuning scheduling and operations
  • Must throttle systems to avoid I/O contention
  • Infrequent sync times by necessity, not preference. Portions of logs spread between Web servers and storage adds inconvenience and potential for errors
  • Problems scaling. Traditional NAS servers can easily run out of resources with too many simultaneous client connections

 

Most of these limitations come from a centralized storage system that cannot process log updates in parallel. Traditional network attached storage systems that are not distributed suffer from these kinds of bottlenecks.

 

Scalable and Efficient Log Capture

 

Unlike conventional storage and file systems, new distributed systems easily handle the workload demands of hundreds to thousands of Web servers. This delivers the following capabilities for logging:

  • No coordination required: Distributed systems place individual log files across all the nodes, thus allowing extensive load sharing. Some systems can also manage atomic log appends so that client applications need not be involved in locking the log files to make sure appended records preserve their individual integrity.
  • No need for custom rsync scripts: As the system can handle numerous simultaneous connections, customers can avoid customizing rsync scripts.
  • No throttling required: Distributed systems expand horizontally to accommodate higher throughput, eliminating the need to throttle batch uploads.
  • More frequent sync operations now feasible: With enough throughput and connection handling easily available, customers can sync more frequently.
  • Scale seamlessly: Adding a new node takes a few minutes, allowing customers to scale easily and expand the amount of log data they can retain.

 


Logging Directly to Network Storage


Many Web and application developers avoid logging to network storage due to availability concerns. In some cases, lack of storage availability can cause logging processes and applications to hang.


Distributed systems address this concern by replicating data across nodes for availability. Write operations are distributed across nodes, and if a node fails for any reason, it is automatically substituted with a spare resource. The system remains available to accept log file operations throughout this process, ensuring that log operations complete without constraining the application.


Another concern for Web developers around logging to network storage is the need to manage locking when multiple clients are writing to the same file. This process can be more trouble than it is worth and often results in a brute force fallback option of saving individual log files and then batch uploading at set intervals.
Systems offering an atomic append mode enable easy updates to log files. This dramatically simplifies the process of creating atomic log files shared by hundreds to thousands of servers.

 


Conclusion


Log file capture is difficult to manage with conventional storage and file systems that do not scale and cannot handle the simultaneous load that is part of batch log updates.


Distributed systems can handle simultaneous batch uploads with ease. Coupled with atomic append mode features, these systems that makes logging directly to network storage practical and efficient, and do not require the use of distributed locks to coordinate clients.


Together, these capabilities help Web companies take control of log data, allowing them to effectively capture and analyze vast amounts of information to successfully run their business.

Add a comment Leave a comment on this blog post.

There are no comments on this post

Fax Automation as a Cloud Service

This white paper details how organizations can take advantage of fax as a cloud service without software or hardware, and without sacrificing security or ERP integration.

Avoiding the Hazards of IT Consolidation

Read this technology brief to learn how processes and tools providing rich contextual and application visibility from end-to-end across the enterprise and WAN optimisation technology help to accelerate high-priority traffic.

Social Media Policies Toolkit

Define the rules at your company for the proper use of social media platforms such as Blogs, Twitter, Facebook and Youtube. Ensure your users are spending their time productively and company resources are being used for the business.

Learn more >

The IT Service Catalog Management Toolkit

Bridge the IT-business gap once and for all! A well documented IT services catalog is the conduit for IT services to the rest of the company.

Learn more >