The proliferation of online advertising across the Web has had a dual impact. First, it has enabled companies to gain extensive insight into users’ online behaviors, which creates a slew of opportunities to specifically cater ads to a well-defined target market. However, it has also increasingly created challenges for IT in capturing and managing ever-expanding loads of log data. With the wealth of valuable data now available, it is up to data center and application managers to adequately track and analyze this critical information to improve ad relevant and best-in-class user experiences.
Log data capacity is already too much to handle for today’s insufficient storage and file system technologies. Newer architectures — supporting the tracking and analysis of every single click by every single user — are needed for organizations to better manage this information. A highly optimized distributed file serving infrastructure can help alleviate the woes experienced by social networking, photo sharing and ad serving companies — and in the long run, prevent them from drowning in their own log data.
Challenges with Batch Log Uploads
Typically, logs are captured locally on each Web server. Large Web properties often have hundreds to thousands of Web servers, each of which uploads portions of its log data to a centralized storage repository at a set interval.

This process has several drawbacks, most notably:
- Excessive client coordination to avoid storage overload. Web servers must be manually set not to upload log files at the same time to avoid bottlenecks
- Custom rsync scripts required for fine-tuning scheduling and operations
- Must throttle systems to avoid I/O contention
- Infrequent sync times by necessity, not preference. Portions of logs spread between Web servers and storage adds inconvenience and potential for errors
- Problems scaling. Traditional NAS servers can easily run out of resources with too many simultaneous client connections
Most of these limitations come from a centralized storage system that cannot process log updates in parallel. Traditional network attached storage systems that are not distributed suffer from these kinds of bottlenecks.
Scalable and Efficient Log Capture
Unlike conventional storage and file systems, new distributed systems easily handle the workload demands of hundreds to thousands of Web servers. This delivers the following capabilities for logging:
- No coordination required: Distributed systems place individual log files across all the nodes, thus allowing extensive load sharing. Some systems can also manage atomic log appends so that client applications need not be involved in locking the log files to make sure appended records preserve their individual integrity.
- No need for custom rsync scripts: As the system can handle numerous simultaneous connections, customers can avoid customizing rsync scripts.
- No throttling required: Distributed systems expand horizontally to accommodate higher throughput, eliminating the need to throttle batch uploads.
- More frequent sync operations now feasible: With enough throughput and connection handling easily available, customers can sync more frequently.
- Scale seamlessly: Adding a new node takes a few minutes, allowing customers to scale easily and expand the amount of log data they can retain.

Logging Directly to Network Storage
Many Web and application developers avoid logging to network storage due to availability concerns. In some cases, lack of storage availability can cause logging processes and applications to hang.
Distributed systems address this concern by replicating data across nodes for availability. Write operations are distributed across nodes, and if a node fails for any reason, it is automatically substituted with a spare resource. The system remains available to accept log file operations throughout this process, ensuring that log operations complete without constraining the application.
Another concern for Web developers around logging to network storage is the need to manage locking when multiple clients are writing to the same file. This process can be more trouble than it is worth and often results in a brute force fallback option of saving individual log files and then batch uploading at set intervals.
Systems offering an atomic append mode enable easy updates to log files. This dramatically simplifies the process of creating atomic log files shared by hundreds to thousands of servers.

Conclusion
Log file capture is difficult to manage with conventional storage and file systems that do not scale and cannot handle the simultaneous load that is part of batch log updates.
Distributed systems can handle simultaneous batch uploads with ease. Coupled with atomic append mode features, these systems that makes logging directly to network storage practical and efficient, and do not require the use of distributed locks to coordinate clients.
Together, these capabilities help Web companies take control of log data, allowing them to effectively capture and analyze vast amounts of information to successfully run their business.
To ShareThis, click on a service below: