How do I create a Robots . txt File for my forum

  • Affected Version
    WoltLab Suite 5.4

    Hello,

    According to a reply in a recent Woltlab Ticket I opened for this issue, Robots . txt Files are not created by default;

    and additionally I was told " 'it is easy to create such a file' ".

    Maybe so for individuals who are well acquainted with coding, creating, and uploading files similar to this.

    However it is not "easy" for me as I have NO experience with this type of work.


    Reasons for creating the robots.txt file:

    According to information I have looked up regarding Google Search Console; SEO; and robots.txt files,

    if there is no such file uploaded for an individual's forum (or website) this can allow all information on one's forum site to be "searched" by robots.

    Additionally, having NO robots.txt file also leads to many "404 Errors".

    This is what is currently occurring with my forum, e.g., many "404 errors".

    Question: Is there anyone on this forum who can help me with making a Robots.txt file?

    I understand this file must be uploaded to the root of my domain. I think I can handle that much. But first I need the file itself to be created.

    DJ


  • You can create a plan text file for that.

    For the content, you could let you inspire by the file from woltlab: https://www.woltlab.com/robots.txt

    Basically, there are a list of blocked sites that should not get indexed or crawled. Could be those sites are blocked by spiders anyway or because they are specific for content creation and spiders don't create content.

  • You can create a plan text file for that.

    For the content, you could let you inspire by the file from woltlab: https://www.woltlab.com/robots.txt

    Basically, there are a list of blocked sites that should not get indexed or crawled. Could be those sites are blocked by spiders anyway or because they are specific for content creation and spiders don't create content.

    "You can create a plan text file for that."

    Although your reply is appreciated, I am not sure exactly what you mean by " ... create plan text file ..."

    The URL above doesn't do me much good because again I back to the position I am in from the "get-go" - e.g., not knowing how to create and implement the needed file/s

    Yes, there is a list of sites that should not get indexed. Google explains the difference between ones that should be blocked and those that should not be blocked. And a website I visited recently lists those as well. That is all well and good for those individuals who KNOW how to implement this information correctly into their domain.

    I could of course hire a firm / company that does this type of work .... it would only cost a few hundred dollars! Some want thousands (yearly contract).

    Well of course I am not in the monetary position to engage such expensive help.

    Perhaps there will still be someone on Woltlab Forum who will see this Thread / Post and come forth with some suggestions / assistance.

    I am not going anywhere ....

  • robots.txt is just that---a text file read before crawling your site by (most-there's still some bad bots out there that operate under well-known names like Majestic12 et al.) crawlers to direct them on what they should and shouldn't be crawling. The formatting is fairly simple, and we can use WoltLab's as an example!

    Code
    User-agent: *
    
    Disallow: /combined-tagged/
    Disallow: /members-list/
    Disallow: /search-result/
    Disallow: /tagged/
    Disallow: /trophy/
    Disallow: /user/

    User-agent here is used with the wildcard * to tell all crawlers (though again, not all crawlers obey this or even robots.txt on the whole) that they shouldn't be indexing (via the following Disallow directives) these specific pages. Often you'll want to stop crawlers from indexing parts of your site only users with the correct permissions should be able to access (since WSC will show a permissions error to them anyway, may as well avoid polluting the index with them), so you supply a robots.txt with those pages disallowed to all crawlers.

    Once you have a robots.txt, just put it in your site root folder. That's it.

  • Hello Nebulon Ranger

    So to make sure I have this right:

    In the Code Table you show above as numbers 2 thru 8, I would put in there what Google indicates which items should be included in the robts.txt file? OR, just use the items you have indicated??

    Why have you left number 2 empty?

    As for uploading the robots.txt file to the root folder, I know where to go in the cPanel.

    DJ

  • Why have you left number 2 empty?

    I directly copied from WoltLab's own robots.txt and it just so happens they left line 2 blank. Nothing more, nothing less.


    In the Code Table you show above as numbers 2 thru 8, I would put in there what Google indicates which items should be included in the robts.txt file? OR, just use the items you have indicated??

    For all lines after the initial User-agent directive, you would add, one per line, a Disallow for each page you don't want Google, Bing et al. indexing. This would include any pages related to user account management (since not only can a spider not post content, they also can't log in or manage their nonexistent account!), creation of content and any privileged sections you don't want them to just be served a permissions error for trying to index (in my case, the Real Talk board on my upcoming forum isn't viewable by guests, so I'd add it to the list of Disallows).

    robots.txt is fairly site-specific--there are pages that are near-universally disallowed, like the aforementioned account management and administration pages, but for things like content it largely depends on what you, the site owner, want spiders to index.

  • The generator will make a file with this formatting anyway, so if you're feeling lazy you could use it, but you don't have to.

    OK, thanks! When I make the file and upload it to the root folder I will post here if successful - or not.

    One other thing: Since seem to be quite knowledgeable about this stuff, what do you know about "DOM"?

    One website that I logged onto mentioned that any website or forum should not have a DOM total above 1,500. On my site, it is 4,000.

    Is that something I should be concerned about, or is it these websites just trying to drum up business???

  • OK, thanks! When I make the file and upload it to the root folder I will post here if successful - or not.

    One other thing: Since seem to be quite knowledgeable about this stuff, what do you know about "DOM"?

    One website that I logged onto mentioned that any website or forum should not have a DOM total above 1,500. On my site, it is 4,000.

    Is that something I should be concerned about, or is it these websites just trying to drum up business???

    DOM is the Document Object Model--it includes all HTML and style (and some inferred elements, such as the numerous divs used under the hood by the progress HTML element) used on your website, whether dynamically-generated by JavaScript or not. Generally speaking, any source that tries to give you a hard limit on the size of the DOM like this is either misinformed or operating on information that's years out-of-date. The DOM can be as big as you need it to be as long as you don't have duplicate information in it and your site is still responsive for the majority of its users.

  • Hello Nebulon Ranger,

    Thanks for the info on DOM!

    Regarding robots text Files:

    I created the robots.txt file in NotePad on my computer, saved it, and then uploaded the file to what I believe was the correct place.

    However when I ran a "check" of this (logged into my account at Google Search Console), the message that came up indicated there is no such file.

    Is there another "check" (other than logging into my Google Search Console Account) which I can make to see if I actually do have the robots.txt file uploaded properly to my domain root, or not?

    DJ

  • The robots.txt must be in the root directory of your website, so if you have the suite installed at the base of your website, stick it there.

    I loaded the site in your signature and get directed to /app/, so also try putting a copy there.

  • The robots.txt must be in the root directory of your website, so if you have the suite installed at the base of your website, stick it there.

    I loaded the site in your signature and get directed to /app/, so also try putting a copy there.

    I had uploaded the robots.txt file to the wrong folder. It is the correct folder now.

    Somewhere I remember there is a way to "test" this file, but I didn't save it in on my computer, and I can't remember how or where I found the "test".

    The information on Google Search Console site states that their testing link for this type of file does not work with domains. Anyways, I know the file is in the correct folder now on my domain because I also see the particular file therein which was uploaded to create my Google Search Console account.

    As for: "I loaded the site in your signature and get directed to /app/, so also try putting a copy there." -

    I have in my mind always questioned whether or not there was a mistake made (or incorrect procedure initiated) when my domain was originally uploaded; because why should there be 4wardxposure2.com/ AND 4wardxposure2.com/app ?

    I am not the person who established / uploaded my present forum - it was uploaded to the internet by an individual who once was a member of this Woltlab Forum.

    Hey Nebulon Ranger! :)

    I want you to know I really appreciate the time and effort you have extended to me with this particular issue of establishing a robots.txt file!

    Your detailed answers and information was definitely a big help to me!

    DJ

  • The robots.txt must be in the root directory of your website, so if you have the suite installed at the base of your website, stick it there.

    I loaded the site in your signature and get directed to /app/, so also try putting a copy there.

    Hello Nebulon Ranger,

    I know you have a lot going on with your upcoming website ......

    But just to let you know regarding: "I loaded the site in your signature and get directed to /app/, so also try putting a copy there"

    I logged into my Domain Management and went thru every file therein. There is nothing in there that even remotely suggests there is any thing

    other than 4wardxposure2 (dot) com - 4wardxposure2 (dot) com / app / doesn't exist.

    So why is there a "re-direct" from the former to the latter URL?

    Also, in my Google Account, e.g. "Search Console", there is a place to "test" both URL's.

    Among other things, the connectivity speed, etc., is a lot better for 4wardxposure2 / app than with 4wardxposure2 / !

    Any comment on why this is? Or perhaps someone else has an explanation???

  • It's likely that there's a redirect in place from the base directory to /app/, or that Core got installed there, but it's hard to know for sure without seeing the directory structure, which I obviously am not entitled to see. This'd also explain why testing directly to /app/ shows higher performance, since you're avoiding the redirect penalty.

  • It's likely that there's a redirect in place from the base directory to /app/, or that Core got installed there, but it's hard to know for sure without seeing the directory structure, which I obviously am not entitled to see. This'd also explain why testing directly to /app/ shows higher performance, since you're avoiding the redirect penalty.

    Hello! Nebulon Ranger

    Do you think I should contact my Hosting provider?

    OR should I open a Ticket with Woltlab instead?

    My hosting provider has always been good about helping me with issues that involve my Domain, however I am a bit hesitant to get them involved because if they do something to the Domain Root - in trying to solve my problem - and the result turns out to be my Forum becomes inactive, I will

    be "up the creek" (SOL).

    The only other possibility (?) would be to start over from scratch and reload the current Version of Woltlab Software that I have currently.

    But I don't like this idea because it probably would require me hiring someone to do the work.

    OR: Maybe just forget this whole issue and just let the status quo stay as such.

    IF YOU had this same problem, what would you do? (I trust your judgement).

    Thanks,

    DJ

  • Honestly, if it doesn't affect response time for users in any major way then I'd just leave it. Google and other search engines are often anal about minor things that ultimately don't impact their spiders anyway.

  • Honestly, if it doesn't affect response time for users in any major way then I'd just leave it. Google and other search engines are often anal about minor things that ultimately don't impact their spiders anyway.

    Hello Nebulon Ranger

    Thanks for your answer - appreciated! Agree with you!!

    For me, I have always been skeptical about the value, practicality, and usefulness of the Google Search Console.

    Good luck on your website. Hope you do really great with it. :)

    DJ

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!