Log File Analysis must be part of each SEO professional’s instrument belt, however most SEOs have by no means carried out one. Which means most SEOs are lacking out on distinctive and invaluable insights that common crawling instruments simply can’t produce.
Let’s demystify Log File Analysis so it isn’t so intimidating. If you’re within the fantastic world of log recordsdata and what they’ll deliver to your website audits, this information is unquestionably for you.
What are Log Files?
Log Files are recordsdata containing detailed logs on who and what’s making requests to your web site server. Every time a bot makes a request to your website, knowledge (such because the time, date IP handle, person agent, and many others.) is saved on this log. This precious knowledge permits any SEO to discover out what Googlebot and different crawlers are doing in your website. Unlike common crawlings, equivalent to with the Screaming Frog SEO Spider, that is real-world knowledge — not an estimation of how your website is being crawled. It is an actual overview of how your website is being crawled.
Having this correct knowledge may help you establish areas of crawl finances waste, simply discover entry errors, perceive how your SEO efforts are affecting crawling and far, far more. The better part is that, typically, you are able to do this with easy spreadsheet software program.
In this information, we will likely be focussing on Excel to carry out Log File Analysis, however I’ll additionally talk about different instruments equivalent to Screaming Frog’s much less well-known Log File Analyser which may simply make the job a bit simpler and quicker by serving to you handle bigger knowledge units.
Note: proudly owning any software program apart from Excel is just not a requirement to observe this information or get your fingers soiled with Log Files.
How to Open Log Files
Rename .log to .csv
When you get a log file with a .log extension, it’s actually as simple as renaming the file extension .csv and opening the file in spreadsheet software program. Remember to set your working system to present file extensions if you would like to edit these.
How to open break up log recordsdata
Log recordsdata can are available both one huge log or a number of recordsdata, relying on the server configuration of your website. Some servers will use server load balancing to distribute visitors throughout a pool or farm of servers, inflicting log recordsdata to be break up up. The excellent news is that it is very easy to mix, and you should use one in every of these three strategies to mix them after which open them as regular:
- Use the command line in Windows by Shift + right-clicking within the folder containing your log recordsdata and deciding on “Run Powershell from here”
Then run the next command:
copy *.log mylogfiles.csv
You can now open mylogfile.csv and it’ll include all of your log knowledge.
Or if you’re a Mac person, first use the cd command to go to the listing of your log recordsdata:
Then, use the cat or concatenate command to be a part of up your recordsdata:
cat *.log > mylogfiles.csv
2) Using the free instrument, Log File Merge, mix all of the log recordsdata after which edit the file extension to .csv and open as regular.
three) Open the log recordsdata with the Screaming Frog Log File Analyser, which is so simple as dragging and dropping the log recordsdata:
(Please be aware: This step isn’t required if you’re utilizing Screaming Frog’s Log File Analyser)
Once you have got your log file open, you’re going to want to break up the cumbersome textual content in every cell into columns for simpler sorting later.
Excel’s Text to Column perform is useful right here, and is as simple as deciding on all of the stuffed cells (Ctrl / Cmd + A) and going to Excel > Data > Text to Columns and deciding on the “Delimited” choice, and the delimiter being a Space character.
Once you’ve separated this out, you may additionally need to type by time and date — you are able to do so within the Time and Date stamp column, generally separating the information with the “:” colon delimiter.
Your file ought to look related to the one beneath:
As talked about earlier than, don’t fear in case your log file doesn’t look precisely the identical — totally different log recordsdata have totally different codecs. As lengthy as you have got the essential knowledge there (time and date, URL, user-agent, and many others.) you’re good to go!
Understanding Log Files
Now that your log recordsdata are prepared for evaluation, we are able to dive in and begin to perceive our knowledge. There are many codecs that log recordsdata can take with a number of totally different knowledge factors, however they often embrace the next:
- Server IP
- Date and time
- Server request methodology (e.g. GET / POST)
- Requested URL
- HTTP standing code
More particulars on the frequent codecs could be discovered beneath for those who’re within the nitty gritty particulars:
- Apache and NGINX
- Amazon Elastic Load Balancing
- HA Proxy
How to rapidly reveal crawl finances waste
As a fast recap, Crawl Budget is the variety of pages a search engine crawls upon each go to of your website. Numerous elements have an effect on crawl finances, together with hyperlink fairness or area authority, website velocity, and extra. With Log File Analysis, we will likely be ready to see what kind of crawl finances your web site has and the place there are issues inflicting crawl finances to be wasted.
Ideally, we would like to give crawlers essentially the most environment friendly crawling expertise attainable. Crawling shouldn’t be wasted on low-value pages and URLs, and precedence pages (product pages for instance) shouldn’t have slower indexation and crawl charges as a result of a web site has so many lifeless weight pages. The title of the sport is crawl finances conservation, and with good crawl finances conversion comes higher natural search efficiency.
See crawled URLs by person agent
Seeing how steadily URLs of the location are being crawled can rapidly reveal the place search engines like google and yahoo are placing their time into crawling.
If you’re eager about seeing the conduct of a single person agent, that is simple as filtering out the related column in excel. In this case, with a WC3 format log file, I’m filtering the cs(User-Agent) column by Googlebot:
And then filtering the URI column to present the variety of instances Googlebot crawled the house web page of this instance website:
This is a quick manner of seeing if there are any drawback areas by URI stem for a singular user-agent. You can take this a step additional by wanting on the filtering choices for the URI stem column, which on this case is cs-uri-stem:
From this fundamental menu, we are able to see what URLs, together with useful resource recordsdata, are being crawled to rapidly establish any drawback URLs (parameterized URLs that shouldn’t be being crawled for instance).
You may do broader analyses with Pivot tables. To get the variety of instances a selected person agent has crawled a selected URL, choose the entire desk (Ctrl/cmd + A), go to Insert > Pivot Table after which use the next choices:
All we’re doing is filtering by User Agent, with the URL stems as rows, after which counting the variety of instances every User-agent happens.
With my instance log file, I acquired the next:
Then, to filter by particular User-Agent, I clicked the drop-down icon on the cell containing “(All),” and chosen Googlebot:
Understanding what totally different bots are crawling, how cellular bots are crawling in another way to desktop, and the place essentially the most crawling is going on may help you see instantly the place there’s crawl finances waste and what areas of the location want enchancment.
Find low-value add URLs
Crawl finances shouldn’t be wasted on Low value-add URLs, that are usually brought on by session IDs, infinite crawl areas, and faceted navigation.
To do that, return to your log file, and filter by URLs that include a “?” or query mark symbols from the URL column (containing the URL stem). To do that in Excel, bear in mind to use “~?” or tilde query mark, as proven beneath:
A single “?” or query mark, as acknowledged within the auto filter window, represents any single character, so including the tilde is like an escape character and makes certain to filter out the query mark image itself.
Isn’t that simple?
Find duplicate URLs
Duplicate URLs generally is a crawl finances waste and a giant SEO problem, however discovering them generally is a ache. URLs can generally have slight variants (equivalent to a trailing slash vs a non-trailing slash model of a URL).
Ultimately, the easiest way to discover duplicate URLs can also be the least enjoyable manner to accomplish that — you have got to type by website URL stem alphabetically and manually eyeball it.
One manner you’ll find trailing and non-trailing slash variations of the identical URL is to use the SUBSTITUTE perform in one other column and use it to take away all ahead slashes:
=SUBSTITUTE(C2, “/”, “”)
In my case, the goal cell is C2 because the stem knowledge is on the third column.
Then, use conditional formatting to establish duplicate values and spotlight them.
However, eyeballing is, sadly, one of the best methodology for now.
See the crawl frequency of subdirectories
Finding out which subdirectories are getting crawled essentially the most is one other fast manner to reveal crawl finances waste. Although bear in mind, simply because a consumer’s weblog has by no means earned a single backlink and solely will get three views a yr from the enterprise proprietor’s grandma doesn’t imply you need to take into account it crawl finances waste — inner linking construction must be constantly good all through the location and there could be a robust purpose for that content material from the consumer’s perspective.
To discover out crawl frequency by subdirectory degree, you have to to principally eyeball it however the next formulation may help:
The above formulation seems like a little bit of a doozy, however all it does is examine if there’s a trailing slash, and relying on the reply, rely the variety of trailing slashes and subtract both 2 or 1 from the quantity. This formulation may very well be shortened for those who take away all trailing slashes out of your URL listing utilizing the RIGHT formulation — however who has the time. What you’re left with is subdirectory rely (ranging from zero from as the primary subdirectory).
Replace C2 with the primary URL stem / URL cell after which copy the formulation down your whole listing to get it working.
Make certain you change all the C2s with the suitable beginning cell after which type the brand new subdirectory counting column by smallest to largest to get an excellent listing of folders in a logical order, or simply filter by subdirectory degree. For instance, as proven within the beneath screenshots:
The above picture is subdirectories sorted by degree.
The above picture is subdirectories sorted by depth.
If you’re not coping with numerous URLs, you would merely type the URLs by alphabetical order however then you definately gained’t get the subdirectory rely filtering which generally is a lot quicker for bigger websites.
See crawl frequency by content material kind
Finding out what content material is getting crawled, or if there are any content material varieties which are hogging crawl finances, is a superb examine to spot crawl finances waste. Frequent crawling on pointless or low precedence CSS and JS recordsdata, or how crawling is going on on pictures if you’re attempting to optimize for picture search, can simply be noticed with this tactic.
In Excel, seeing crawl frequency by content material kind is as simple as filtering by URL or URI stem utilizing the Ends With filtering choice.
Quick Tip: You may use the “Does Not End With” filter and use a .html extension to see how non-HTML web page recordsdata are being crawled — all the time value checking in case of crawl finances waste on pointless js or css recordsdata, and even pictures and picture variations ( you WordPress). Also, bear in mind in case you have a website with trailing and non-trailing slash URLs to take that into consideration with the “or” operator with filtering.
Spying on bots: Understand website crawl conduct
Log File Analysis permits us to perceive how bots behave by giving us an concept of how they prioritize. How do totally different bots behave in several conditions? With this data, you can’t solely deepen your understanding of SEO and crawling, but in addition provide you with an enormous leap in understanding the effectiveness of your website structure.
See most and least crawled URLs
This technique has been touched up beforehand with seeing crawled URLs by user-agent, but it surely’s even quicker.
In Excel, choose a cell in your desk after which click on Insert > Pivot Table, ensure that the choice incorporates the mandatory columns (on this case, the URL or URI stem and the user-agent) and click on OK.
Once you have got your pivot desk created, set the rows to the URL or URI stem, and the summed worth because the user-agent.
From there, you possibly can right-click within the user-agent column and type the URLs from largest to smallest by crawl rely:
Now you’ll have an awesome desk to make charts from or rapidly overview and search for any problematic areas:
A query to ask your self when reviewing this knowledge is: Are the pages you or the consumer would need being crawled? How typically? Frequent crawling doesn’t essentially imply higher outcomes, however it may be a sign as to what Google and different content material user-agents prioritize most.
Crawl frequency per day, week, or month
Checking the crawling exercise to establish points the place there was lack of visibility round a time period, after a Google replace or in an emergency can inform you the place the issue could be. This is so simple as deciding on the “date” column, ensuring the column is within the “date” format kind, after which utilizing the date filtering choices on the date column. If you’re wanting to analyze an entire week, simply choose the corresponding days with the filtering choices out there.
Crawl frequency by directive
Understanding what directives are being adopted (for example, if you’re utilizing a disallow or perhaps a no-index directive in robots.txt) by Google is important to any SEO audit or marketing campaign. If a website is utilizing disallows with faceted navigation URLs, for instance, you’ll need to ensure that these are being obeyed. If they aren’t, advocate a greater answer equivalent to on-page directives like meta robots tags.
To see crawl frequency by directive, you’ll want to mix a crawl report along with your log file evaluation.
(Warning: We’re going to be utilizing VLOOKUP, but it surely’s actually not as difficult as folks make it out to be)
To get the mixed knowledge, do the next:
- Get the crawl out of your website utilizing your favourite crawling software program. I could be biased, however I’m a giant fan of the Screaming Frog SEO Spider, so I’m going to use that.
If you’re additionally utilizing the spider, observe the steps verbatim, however in any other case, make your individual name to get the identical outcomes.
- Export the Internal HTML report from the SEO Spider (Internal Tab > “Filter: HTML”) and open up the “internal_all.xlsx” file.
From there, you possibly can filter the “Indexability Status” column and take away all clean cells. To do that, use the “does not contain” filter and simply go away it clean. You may add the “and” operator and filter out redirected URLs by making the filter worth equal “does not contain → “Redirected” as proven beneath:
This will present you canonicalized, no-index by meta robots and canonicalized URLs.
- Copy this new desk out (with simply the Address and Indexability Status columns) and paste it in one other sheet of your log file evaluation export.
- Now for some VLOOKUP magic. First, we’d like to ensure that the URI or URL column knowledge is in the identical format because the crawl knowledge.
Log Files don’t typically have the basis area or protocol within the URL, so we both want to take away the top of the URL utilizing “Find and Replace” in our newly made sheet, or make a brand new column in your log file evaluation sheet append the protocol and root area to the URI stem. I want this methodology as a result of then you possibly can rapidly copy and paste a URL that you’re seeing issues with and have a look. However, in case you have a large log file, it’s in all probability quite a bit much less CPU intensive with the “Find and Replace” methodology.
To get your full URLs, use the next formulation however with the URL discipline modified to no matter website you’re analyzing (and ensure the protocol is appropriate as nicely). You’ll additionally need to change D2 to the primary cell of your URL column
https://www.example.com”&D2 Drag down the formulation to the tip of your Log file desk and get a pleasant listing of full URLs:
- Now, create one other column and name it “Indexability Status”. In the primary cell, use a VLOOKUP related to the next: =VLOOKUP(E2,CrawlSheet!A$1:B$1128,2,FALSE). Replace E2 with the primary cell of you “Full URL” column, then make the lookup desk into your new. crawl sheet. Remember to sue the greenback indicators in order that the lookup desk would not change as you. apply the formulation to additional roles. Then, choose the right column (1 can be the primary column of the index desk, so quantity 2 is the one we’re after). Use the FALSE vary lookup mode for actual matching. Now you have got a pleasant tidy listing of URLs and their indexability standing matched with crawl knowledge:
Crawl frequency by depth and inner hyperlinks
This evaluation permits us to see how a website’s structure is performing by way of crawl finances and crawlability. The primary goal is to see in case you have way more URLs than you do requests — and for those who do then you have got an issue. Bots shouldn’t be “giving up” on crawling your whole website and never discovering necessary content material or losing crawl finances on content material that’s not necessary.
Tip: It can also be value utilizing a crawl visualization instrument alongside this evaluation to see the general structure of the location and see the place there are “off-shoots” or pages with poor inner linking.
To get this all-important knowledge, do the next:
- Crawl your website along with your most well-liked crawling instrument and export whichever report has each the press depth and variety of inner hyperlinks with every URL.
In my case, I’m utilizing the Screaming Frog SEO Spider, going exporting the Internal report:
- Use a VLOOKUP to match your URL with the Crawl Depth column and the variety of Inlinks, which offers you one thing like this:
- Depending on the kind of knowledge you need to see, you may want to filter out solely URLs returning a 200 response code at this level or make them filterable choices within the pivot desk we create later. If you’re checking an e-commerce website, you may want to focus solely on product URLs, or for those who’re optimizing crawling of pictures you possibly can filter out by file kind by filtering the URI column of your log file utilizing the “Content-Type” column of your crawl export and making an choice to filter with a pivot desk. As with all of those checks, you have got loads of choices!
- Using a pivot desk, now you can analyze crawl fee by crawl depth (filtering by the actual bot on this case) with the next choices:
To get one thing like the next:
Better knowledge than Search Console? Identifying crawl points
Search Console could be a go-to for each SEO, but it surely definitely has flaws. Historical knowledge is more durable to get, and there are limits on the variety of rows you possibly can view (presently of writing it’s 1000). But, with Log File Analysis, the sky’s the restrict. With the next checks, we’re going to be found crawl and response errors to give your website a full well being examine.
Discover Crawl Errors
An apparent and fast examine to add to your arsenal, all you have got to do is filter the standing column of your log file (in my case “sc-status” with a W3C log file kind) for 4xx and 5xx errors:
Find inconsistent server responses
A specific URL could have various server responses over time, which may both be regular conduct, equivalent to when a damaged hyperlink has been mounted or an indication there’s a critical server problem occurring equivalent to when heavy visitors to your website causes much more inner server errors and is affecting your website’s crawlability.
Analyzing server responses is as simple as filtering by URL and by Date:
Alternatively, if you would like to rapidly see how a URL is various in response code, you should use a pivot desk with the rows set to the URL, the columns set to the response codes and counting the variety of instances a URL has produced that response code. To obtain this setup create a pivot desk with the next settings:
This will produce the next:
As you possibly can see within the above desk, you possibly can clearly see “/inconcistent.html” (highlighted within the crimson field) has various response codes.
View Errors by Subdirectory
To discover which subdirectories are producing essentially the most issues, we simply want to do some easy URL filtering. Filter out the URI column (in my case “cs-uri-stem”) and use the “contains” filtering choice to choose a selected subdirectory and any pages inside that subdirectory (with the wildcard *):
For me, I checked out the weblog subdirectory, and this produced the next:
View Errors by User Agent
Finding which bots are struggling could be helpful for quite a few causes together with seeing the variations in web site efficiency for cellular and desktop bots, or which search engines like google and yahoo are finest ready to crawl extra of your website.
You may need to see which specific URLs are inflicting points with a selected bot. The simplest way to do that is with a pivot desk that permits for filtering the variety of instances a selected response code happens per URI. To obtain this make a pivot desk with the next settings:
From there, you possibly can filter by your chosen bot and response code kind, equivalent to picture beneath, the place I am filtering for Googlebot desktop to hunt down 404 errors:
Alternatively, you can even use a pivot desk to see what number of instances a selected bot produces totally different response codes as an entire by making a pivot desk that filters by bot, counts by URI prevalence, and makes use of response codes as rows. To obtain this use the settings beneath:
For instance, within the pivot desk (beneath), I’m what number of of every response code Googlebot is receiving:
Diagnose on-page issues
Websites want to be designed not only for people, however for bots. Pages shouldn’t be sluggish loading or be an enormous obtain, and with log file evaluation, you possibly can see each of those metrics per URL from a bot’s perspective.
Find sluggish & massive pages
While you possibly can type your log file by the “time taken” or “loading time” column from largest to smallest to discover the slowest loading pages, it’s higher to take a look at the typical load time per URL as there may very well be different elements which may have contributed to a sluggish request apart from the online web page’s precise velocity.
To do that, create a pivot desk with the rows set to the URI stem or URL and the summed worth set to the time taken to load or load time:
Then utilizing the drop-down arrow, on this case, the place it says “Sum of time-taken” and go to “Value Field Settings”:
In the brand new window, choose “Average” and also you’re all set:
Now you need to have one thing related to the next if you type the URI stems by largest to smallest and common time taken:
Find massive pages
You can now add the obtain dimension column (in my case “sc-bytes”) utilizing the settings proven beneath. Remember that the set the dimensions to the typical or sum relying on what you prefer to to see. For me, I’ve accomplished the typical:
And you need to get one thing related to the next:
Bot conduct: Verifying and analyzing bots
The finest and simplest way to perceive bot and crawl conduct is with log file evaluation as you’re once more getting real-world knowledge, and it’s quite a bit much less problem than different strategies.
Find un-crawled URLs
Simply take the crawl of your web site along with your instrument of selection, after which take your log file an examine the URLs to discover distinctive paths. You can do that with the “Remove Duplicates” function of Excel or conditional formatting, though the previous is quite a bit much less CPU intensive particularly for bigger log recordsdata. Easy!
Identify spam bots
Unnecessary server pressure from spam and spoof bots is definitely recognized with log recordsdata and a few fundamental command line operators. Most requests may even have an IP related to it, so utilizing your IP column (in my case, it’s titled “c-ip” in a W3C format log), take away all duplicates to discover every particular person requesting IP.
From there, you need to observe the method outlined in Google’s doc for verifying IPs (be aware: For Windows customers, use the nslookup command):
Or, for those who’re verifying a bing bot, use their useful instrument:
Conclusion: Log Files Analysis — not as scary because it sounds
With some easy instruments at your disposal, you possibly can dive deep into how Googlebot behaves. When you perceive how a web site handles crawling, you possibly can diagnose extra issues than you possibly can chew — however the actual energy of Log File Analysis lies in having the ability to check your theories about Googlebot and increasing the above methods to collect your individual insights and revelations.
What theories would you check utilizing log file evaluation? What insights may you collect from log recordsdata apart from those listed above? Let me know within the feedback beneath.