antiX oldforums archive (you can browse online, or clone, or download a copy)

Forum Forums News Announcements antiX oldforums archive (you can browse online, or clone, or download a copy)

Tagged: 

  • This topic has 19 replies, 10 voices, and was last updated Jun 30-11:39 am by mroot.
Viewing 15 posts - 1 through 15 (of 20 total)
  • Author
    Posts
  • #2544
    Anonymous

      updated Dec 26, 2017

      a searchable copy of the oldforums archive content is now available.

      To search, use the navigation link provided in page headerbar (“Forum” … “Quick News” …)
      click “Forum” and choose “Old Forum Archive” from the dropdown menu

      You can browse the archive contents here: https://antixlinux.com/archive

      a

      =================

      .
      (below, the links / screenshots from the earlier announcement are still valid)

      .
      You can go here to browse the archive: antiX oldforums archive (2007–2017)

      To download a copy for offline browsing, visit /antix-skidoo/antix-skidoo.github.io”>https://github.com/antix-skidoo/antix-skidoo.github.io
      and click the “Clone or Download” button.
      The zipfile is 27Mb, the expanded content occupies 112MB size on disk.
      You can change the styling of the archived html pages by editing the css rules within aaa_oldfourms.css
      Within the downloaded zipfile, the index page (homepage) is archive/index.html
      -=-
      Tip: you can gain full (even wildcard queries) search capability for the content of your downloaded copy
      by installing debian package “recoll” (and reading its docs) and creating a searchindex of ~/path/to/extracted_files/archive

      1

      2

      3

      4

      notes:

      — median size of topic page sizes is only 9Kb (gobs of inline scripts and stylesheets have been removed)

      — joinDate, numTotalPosts ~~ these details are present, but (to minimize clutter) are currently suppressed from display via css rules

      — the archive scrape (er, mirror) operation did NOT capture attached files, nor (uploaded and embedded) attached images

      — each of the pages calls 2 external files: aaa_oldforums.css and aaa_oldforums.js (within your copy of the archive fileset, you can modify these to suit)

      • This topic was modified 5 years, 5 months ago by rokytnji.
      #2550
      Forum Admin
      BitJam
        Helpful
        Up
        0
        ::

        Holy cow! This is incredible!

        Context is worth 80 IQ points -- Alan Kay

        #2551
        Moderator
        caprea
          Helpful
          Up
          0
          ::

          Really cool ! Tank you!

          #2552
          Forum Admin
          anticapitalista
            Helpful
            Up
            0
            ::

            W.O.W! Brilliant work skidoo.

            Philosophers have interpreted the world in many ways; the point is to change it.

            antiX with runit - leaner and meaner.

            #2554
            Forum Admin
            dolphin_oracle
              Helpful
              Up
              0
              ::

              that is pretty cool!

              #2562
              Forum Admin
              Dave
                Helpful
                Up
                0
                ::

                Skidoo is this an html call archive. As if recursively crawling the links in a Web browser or using wget?

                I am wondering if we could work a script to parse the html into a comma separated sql file. Then try and import it into the current forum sql database.

                On second thought maybe it is better to host this along side of the forum under a separate archive link and make a decent search page for the archive.

                Computers are like air conditioners. They work fine until you start opening Windows. ~Author Unknown

                #2565
                Anonymous
                  Helpful
                  Up
                  0
                  ::

                  Dave, I extracted links from the 280 or topic list pages, using firefox addon “Link Gopher”
                  then fed the pages to “httrack” (a crawler, not a scraper) instructing it to “get separated (sic) pages”.

                  python + scapy (an xpath scraper) library could extract (from the original or) from the pages in the archive set:
                  author.userid
                  author.name
                  author.join_date (00 Feb 2000)
                  author.total_num_posts
                  subforum.id
                  subforum.name
                  subforum.total_topics
                  subforum.total_posts
                  subforum.last_post_date (00 Feb 0000, 00:00)
                  post.id
                  post.datetime (00-0000-00T00:00)
                  post.num_within_topic
                  post.content
                  post.topic_id
                  topic.id
                  topic.title
                  topic.startedby_userid
                  topic.num_posts
                  topic.last_post_datetime (2008-00-00T00:00)

                  A script could sanitize these then store ’em to your target db engine + schema.
                  Importing into an existing database would be imposs difficult, due to collisions (user.id, topic.id, post.id)

                  #2566
                  Anonymous
                    Helpful
                    Up
                    0
                    ::

                    Excellent work skidoo
                    I used antix before just not a member of the forums.
                    I used antix (circa 15) have to dig old disk out and look as a
                    rescue disk on my friends windoze xp and vista pc computers.

                    #2583
                    Member
                    watsoccurring
                      Helpful
                      Up
                      0
                      ::

                      That is so useful, many thanks skidoo.

                      #2610
                      Forum Admin
                      Dave
                        Helpful
                        Up
                        0
                        ::

                        @ skidoo likely will be a collision problem though I think most id’s would be auto incrementing so could be left blank to allow that to happen. User ID would be a big problem where we would probably need to make a archive user and set that user ID as default in the sql.
                        That being said it would be difficult, so maybe we could work out a search index in the website and use the extraction as a static archive that can be linked to in the forum like the faq/user manual.?

                        Computers are like air conditioners. They work fine until you start opening Windows. ~Author Unknown

                        #4228
                        Moderator
                        BobC
                          Helpful
                          Up
                          0
                          ::

                          I just want to say what a GREAT thing that was to do…

                          #4238
                          Forum Admin
                          Dave
                            Helpful
                            Up
                            0
                            ::

                            Does the archive appear fine under the forum link and does the content display properly?
                            I tried embedding it into the main antiX site.
                            Archive
                            Unwrapped content:
                            https://antixlinux.com/forum-archive

                            Perhaps later a search function could be added
                            Added: duckduckgo, but has not been crawled.
                            Anyone that knows how to submit a url to duckduckgo feel free to add antixlinux.com/forum-archive/

                            • This reply was modified 5 years, 4 months ago by Dave.
                            • This reply was modified 5 years, 4 months ago by Dave.
                            • This reply was modified 5 years, 4 months ago by Dave.

                            Computers are like air conditioners. They work fine until you start opening Windows. ~Author Unknown

                            #4246
                            Anonymous
                              Helpful
                              Up
                              0
                              ::

                              The newly-added navigation link (and links in your post) work. Search via DDG does not find any results.

                              DDG does not accept crawl requests, but they utilize the searchindex of yandex.com… and I created a yandex webmaster account and submitted the archive index page URL.

                              When I later found (still) no results returned by DDG, I returned to yandex and read (realized):
                              they don’t accept “crawl requests” (sigh). What they provide is way to request indexing of specific URLs, and must paste each URL into submission box ~~ max 100 per day. This is not viable; the archive comprises 7,000+ URLs.

                              #4247
                              Forum Admin
                              Dave
                                Helpful
                                Up
                                0
                                ::

                                Bummer. Maybe they will crawl based on one url being submitted? I see that the new forum and website show up if you use them in the site: specification.

                                Edit:
                                Better yet. I can make a webpage holding nothing but the ls of the archive. If we submit that one page would they not index all urls within?

                                • This reply was modified 5 years, 4 months ago by Dave.

                                Computers are like air conditioners. They work fine until you start opening Windows. ~Author Unknown

                                #4250
                                Anonymous
                                  Helpful
                                  Up
                                  0
                                  ::

                                  If we submit that one page would they not index all urls within?

                                  As mentioned in my prior post, I’m convinced yandex is a dead-end.
                                  This https://www.webnots.com/how-to-submit-your-site-to-yandex/ reads like an accurate walkthrough of the steps I had performed.
                                  Different from my recollection, the article/screenshot indicates “max 20 per day” (vs 100).
                                  I’m sending you a message containing the yandex webmaster account login details. Maybe you’ll find something, some “trick”, that I’ve missed.

                                Viewing 15 posts - 1 through 15 (of 20 total)
                                • You must be logged in to reply to this topic.