Pywikibot: Difference between revisions

Revision as of 05:25, 29 January 2018

Setting this up is confusing as hell, mainly because the documentation is lacking.

Getting, Configuring, Installing

I have the pywikibot software set up with two remotes: one official (Wikimedia gerrit), and one unofficial (my own git repo).

Link to pywikibot on Wikimedia Foundation's gerrit: https://gerrit.wikimedia.org/r/pywikibot/core.git

Link to pywikibot on git.charlesreid1.com: https://charlesreid1.com:3000/wiki/pywikibot

Wikimedia gerrit

Note the official pywikibot repo is also cloned on Github: https://github.com/wikimedia/pywikibot-core/

Start by checking it out:

$ git clone https://gerrit.wikimedia.org/r/pywikibot/core.git pywikibot
$ cd pywikibot

Install all the pip stuff that you may need:

$ pip install -r requirements.txt

Update git submodules:

$ git submodule update --init

Add a custom family file to the big directory of family files:

$ ls pywikibot/families
...
wikivoyage_family.py
wiktionary_family.py
wowwiki_family.py

This is where you will put your custom family file. Here's what the custom family file looks like:

from pywikibot import family

class Family(family.Family):
    def __init__(self):
        family.Family.__init__(self)
        self.name = 'charlesreid1'
        self.langs = {
            'en': 'charlesreid1.com',
        }

Copy and paste this into pwb/pywikibot/families/charlesreid1_family.py (where pwb is the name of the directory where you checked out the git repository).

Now you should be able to log into the wiki as your bot:

$ python pwb.py login
Password for user Bleep bloop on charlesreid1:en (no characters will be shown):

Logging in to charlesreid1:en as Bleep bloop
WARNING: /Users/charles/codes/pywikibot/pywikibot/tools/__init__.py:1717: UserWarning: File /Users/charles/codes/pywikibot/pywikibot.lwp had 644 mode; converted to 600 mode.
Logged in on charlesreid1:en as Bleep bloop.

git.charlesreid1.com

Link to pywikibot on git.charlesreid1.com: https://charlesreid1.com:3000/wiki/pywikibot

To push changes to the pywikibot on git.charlesreid1.com I set up the repo with another remote:

$ git remote add cmr https://charlesreid1.com:3000/wiki/pywikibot
$ git push cmr master

Running Simple Scripts

There are two ways to use pywikibot:

Write your own custom actions
Use a bundle of scripts that come packaged with pywikibot

Prewritten Scripts

Here is a list of all the pre-written scripts for MediaWiki wikis: https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:Pywikibot/Scripts

These are also located in the scripts/ folder of the repository.

To run a given script, you actually run it THROUGH the pwb.py script. See example below.

Redirect.py Script

Suppose we wanted to run the script redirect.py to programmatically deal with redirects on our wiki. We can start by looking at the documentation for this file, which shows us there are many options for this script:

Script to resolve double redirects, and to delete broken redirects.

Requires access to MediaWiki's maintenance pages or to a XML dump file.
Delete function requires adminship.

Syntax:

    python pwb.py redirect action [-arguments ...]

where action can be one of these:

double         Fix redirects which point to other redirects.
do             Shortcut action command is "do".

broken         Tries to fix redirect which point to nowhere by using the last
br             moved target of the destination page. If this fails and the
               -delete option is set, it either deletes the page or marks it
               for deletion depending on whether the account has admin rights.
               It will mark the redirect not for deletion if there is no speedy
               deletion template available. Shortcut action command is "br".

both           Both of the above. Retrieves redirect pages from live wiki,
               not from a special page.

and arguments can be:

-xml           Retrieve information from a local XML dump
               (https://download.wikimedia.org). Argument can also be given as
               "-xml:filename.xml". Cannot be used with -fullscan or -moves.

-fullscan      Retrieve redirect pages from live wiki, not from a special page
               Cannot be used with -xml.

-moves         Use the page move log to find double-redirect candidates. Only
               works with action "double", does not work with -xml.

               NOTE: You may use only one of these options above.
               If neither of -xml -fullscan -moves is given, info will be
               loaded from a special page of the live wiki.

-page:title    Work on a single page

-namespace:n   Namespace to process. Can be given multiple times, for several
               namespaces. If omitted, only the main (article) namespace is
               treated.

-offset:n      With -moves, the number of hours ago to start scanning moved
               pages. With -xml, the number of the redirect to restart with
               (see progress). Otherwise, ignored.

-start:title   The starting page title in each namespace. Page need not exist.

-until:title   The possible last page title in each namespace. Page needs not
               exist.

-total:n       The maximum count of redirects to work upon. If omitted, there
               is no limit.

-delete        Prompt the user whether broken redirects should be deleted (or
               marked for deletion if the account has no admin rights) instead
               of just skipping them.

-sdtemplate:x  Add the speedy deletion template string including brackets.
               This enables overriding the default template via i18n or
               to enable speedy deletion for projects other than wikipedias.

-always        Don't prompt you for each replacement.

Suppose we want to eliminate double-redirects. To do this, we run the redirect script through pwb.py, and pass it the double argument like so:

$ python pwb.py redirect double

Advanced Usage

History object:

class History(object):

    """
    Store previously found dead links.

    The URLs are dictionary keys, and
    values are lists of tuples where each tuple represents one time the URL was
    found dead. Tuples have the form (title, date, error) where title is the
    wiki page where the URL was found, date is an instance of time, and error
    is a string with error code and message.

    We assume that the first element in the list represents the first time we
    found this dead link, and the last element represents the last time.

    Example:

    dict = {
        'https://www.example.org/page': [
            ('WikiPageTitle', DATE, '404: File not found'),
            ('WikiPageName2', DATE, '404: File not found'),
        ]

    """

    def __init__(self, reportThread, site=None):
        """Constructor."""
        self.reportThread = reportThread
        if not site:
            self.site = pywikibot.Site()
        else:
            self.site = site
        self.semaphore = threading.Semaphore()
        self.datfilename = pywikibot.config.datafilepath(
            'deadlinks', 'deadlinks-%s-%s.dat' % (self.site.family.name,
                                                  self.site.code))
        # Count the number of logged links, so that we can insert captions
        # from time to time
        self.logCount = 0
        try:
            with open(self.datfilename, 'rb') as datfile:
                self.historyDict = pickle.load(datfile)
        except (IOError, EOFError):
            # no saved history exists yet, or history dump broken
            self.historyDict = {}

    def log(self, url, error, containingPage, archiveURL):
        """Log an error report to a text file in the deadlinks subdirectory."""
        if archiveURL:
            errorReport = u'* %s ([%s archive])\n' % (url, archiveURL)
        else:
            errorReport = u'* %s\n' % url
        for (pageTitle, date, error) in self.historyDict[url]:
            # ISO 8601 formulation
            isoDate = time.strftime('%Y-%m-%d %H:%M:%S', time.gmtime(date))
            errorReport += "** In [[%s]] on %s, %s\n" % (pageTitle, isoDate,
                                                         error)
        pywikibot.output(u"** Logging link for deletion.")
        txtfilename = pywikibot.config.datafilepath('deadlinks',
                                                    'results-%s-%s.txt'
                                                    % (self.site.family.name,
                                                       self.site.lang))
        txtfile = codecs.open(txtfilename, 'a', 'utf-8')
        self.logCount += 1
        if self.logCount % 30 == 0:
            # insert a caption
            txtfile.write('=== %s ===\n' % containingPage.title()[:3])
        txtfile.write(errorReport)
        txtfile.close()

        if self.reportThread and not containingPage.isTalkPage():
            self.reportThread.report(url, errorReport, containingPage,
                                     archiveURL)

    def setLinkDead(self, url, error, page, weblink_dead_days):
        """Add the fact that the link was found dead to the .dat file."""
        self.semaphore.acquire()
        now = time.time()
        if url in self.historyDict:
            timeSinceFirstFound = now - self.historyDict[url][0][1]
            timeSinceLastFound = now - self.historyDict[url][-1][1]
            # if the last time we found this dead link is less than an hour
            # ago, we won't save it in the history this time.
            if timeSinceLastFound > 60 * 60:
                self.historyDict[url].append((page.title(), now, error))
            # if the first time we found this link longer than x day ago
            # (default is a week), it should probably be fixed or removed.
            # We'll list it in a file so that it can be removed manually.
            if timeSinceFirstFound > 60 * 60 * 24 * weblink_dead_days:
                # search for archived page
                try:
                    archiveURL = get_archive_url(url)
                except Exception as e:
                    pywikibot.warning(
                        'get_closest_memento_url({0}) failed: {1}'.format(
                            url, e))
                    archiveURL = None
                if archiveURL is None:
                    archiveURL = weblib.getInternetArchiveURL(url)
                if archiveURL is None:
                    archiveURL = weblib.getWebCitationURL(url)
                self.log(url, error, page, archiveURL)
        else:
            self.historyDict[url] = [(page.title(), now, error)]
        self.semaphore.release()

    def setLinkAlive(self, url):
        """
        Record that the link is now alive.

        If link was previously found dead, remove it from the .dat file.

        @return: True if previously found dead, else returns False.
        """
        if url in self.historyDict:
            self.semaphore.acquire()
            try:
                del self.historyDict[url]
            except KeyError:
                # Not sure why this can happen, but I guess we can ignore this.
                pass
            self.semaphore.release()
            return True
        else:
            return False

    def save(self):
        """Save the .dat file to disk."""
        with open(self.datfilename, 'wb') as f:
            pickle.dump(self.historyDict, f, protocol=config.pickle_protocol)

Flags