I used to write log parsing scripts all the time with Python. That's basically how I got started programming. In the past few years, I've been working almost entirely on application development. Recently my boss wanted to get the unique number of active users in a day and the unique number of active users within the past 30 days, in addition to an actual list of each of those users with the number of requests that they made to the system.
There were a few issues. I had to keep in mind that there were sometimes anonymous requests, which I didn't want to add to my list since it didn't serve my purpose. I also noticed that there were some lines that didn't seem to contain a request at all, so I also had to account for that. Let's all hop on the regex party train!
Thanks to the magic of regex, I can very quickly tell whether or not a line is a request from an anonymous user and skip that line. I can also skip a line if there is no IP address at the beginning, which occurs when there are usually requests that cover multiple lines. I also have a regex to match the date format that our logs are using. If there's a date, then I grab it; if not, then I can skip that line too.
Now I'm left with only valid lines. SWEET! But what do I do with them? Since I need to keep track of both users by day (along with number of requests) and the numbers of unique users across the day and past 30 days, a hash will definitely be my best friend. All of the information in this case will go into a hash called all_days. Here is the basic format:
Now that we have valid lines, we can start populating this hash. If the current date exists as a key, then we increase the given user's request count. If it doesn't, then we instantiate the user/request_count hash with a default value of zero. That's actually the bulk of the work. After that it's just a matter of counting and generating a list of users for the past 30 days, then uniquing that list.
If anyone has any comments on how I could make this better, I would welcome the feedback!