From charlesreid1

Site Objects

Once you have installed Pywikibot (see Pywikibot/Installing) and set up Pywikibot to run scripts (see Pywikibot/Setup), you should be ready to create a Site object to represent the MediaWiki site you are scraping.

import pywikibot
s = pywikibot.Site()

Useful Actions

Get all pages

To get all pages:

>>> s.allpages()
>>> help(s.allpages)
Help on method allpages in module pywikibot.site:

allpages(start='!', prefix='', namespace=0, filterredir=None, filterlanglinks=None, minsize=None, maxsize=None, protect_type=None, protect_level=None, reverse=False, total=None, content=False, throttle=NotImplemented, limit='[deprecated name of total]', step=NotImplemented, includeredirects='[deprecated name of filterredir]') method of pywikibot.site.APISite instance
    Iterate pages in a single namespace.

This will return a PageGenerator object.

Get all categories

To get all categories (a category is just a page in the category namespace, so this still returns a PageGenerator object):

>>> s.allcategories()
>>> help(s.allcategories)
Help on method allcategories in module pywikibot.site:

allcategories(start='!', prefix='', total=None, reverse=False, content=False, step=NotImplemented) method of pywikibot.site.APISite instance
    Iterate categories used (which need not have a Category page).

    Iterator yields Category objects. Note that, in practice, links that
    were found on pages that have been deleted may not have been removed
    from the database table, so this method can return false positives.

    @param start: Start at this category title (category need not exist).
    @param prefix: Only yield categories starting with this string.
    @param reverse: if True, iterate in reverse Unicode lexigraphic
        order (default: iterate in forward order)
    @param content: if True, load the current content of each iterated page
        (default False); note that this means the contents of the category
        description page, not the pages that are members of the category

The allcategories method will return a generator object, which only returns all categories when you iterate over the generator; to automatically iterate over the generator and make a list, wrap the call to allcategories() in a call to list():

>>> allcats = list(s.allcategories())
>>> print(allcats[0])
Category('Category:ADT')

Get pages in a category

To get all pages in a given category, you need a reference to the category page. Here we start by getting one category page reference:

>>> cat = list(s.allcategories())[0]
>>> print(cat)
Category('Category:ADT')

>>> s.categorymembers(cat)
<pywikibot.data.api.PageGenerator at 0x105705128>

>>> list(s.categorymembers(cat))
[Page('Binary Search Trees/ADT'),
 Page('Binary Trees/ADT'),
 Page('Graphs/ADT'),
 Page('Java/Binary Search Trees'),
 Page('OOP Checklist'),
 Page('Trees/ADT')]

To get uncategorized pages:

>>> help(s.uncategorizedpages)
Help on method uncategorizedpages in module pywikibot.site:

uncategorizedpages(total=None, number='[deprecated name of total]', step=NotImplemented, repeat=NotImplemented) method of pywikibot.site.APISite instance
    Yield Pages from Special:Uncategorizedpages.

    @param total: number of pages to return

As before, this method returns a generator, but you can wrap this in a call to list to return all uncategorized pages at once:

>>> uncats = list(s.uncategorizedpages())

Get categories of a page

To get the categories on a given page:

>>> s.pagecategories()
>>> help(s.pagecategories)
Help on method pagecategories in module pywikibot.site:

pagecategories(page, total=None, content=False, withSortKey=NotImplemented, step=NotImplemented) method of pywikibot.site.APISite instance
    Iterate categories to which page belongs.

    @param content: if True, load the current content of each iterated page
        (default False); note that this means the contents of the
        category description page, not the pages contained in the category

All Available Methods

$ ipython
imporPython 3.6.4 (default, Jan 12 2018, 05:16:29)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pywikibot

In [2]: s = pywikibot.Site()

In [3]: dir(s)
Out[3]:
['OnErrorExc',
 '_BaseSite__code',
 '_BaseSite__family',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_build_namespaces',
 '_cache_proofreadinfo',
 '_cmpkey',
 '_dl_errors',
 '_ep_errors',
 '_ep_text_overrides',
 '_generator',
 '_get_titles_with_hash',
 '_interwiki_urls',
 '_interwikimap',
 '_locked_pages',
 '_loginstatus',
 '_mh_errors',
 '_msgcache',
 '_mv_errors',
 '_pagemutex',
 '_paraminfo',
 '_patrol_errors',
 '_protect_errors',
 '_rb_errors',
 '_relogin',
 '_request',
 '_request_class',
 '_simple_request',
 '_siteinfo',
 '_update_page',
 '_username',
 'allcategories',
 'allimages',
 'alllinks',
 'allpages',
 'allusers',
 'ancientpages',
 'article_path',
 'assert_valid_iter_params',
 'blocks',
 'blockuser',
 'botusers',
 'broken_redirects',
 'case',
 'categories',
 'category_namespace',
 'category_namespaces',
 'category_on_one_line',
 'categoryinfo',
 'categorymembers',
 'checkBlocks',
 'checkCharset',
 'code',
 'compare',
 'cookies',
 'create_new_topic',
 'data_repository',
 'dbName',
 'deadendpages',
 'delete_post',
 'delete_topic',
 'deletedrevs',
 'deletepage',
 'disambcategory',
 'doc_subpage',
 'double_redirects',
 'editpage',
 'expand_text',
 'exturlusage',
 'fam',
 'family',
 'forceLogin',
 'fromDBName',
 'getExpandedString',
 'getFilesFromAnHash',
 'getImagesFromAnHash',
 'getNamespaceIndex',
 'getParsedString',
 'getPatrolToken',
 'getSite',
 'getToken',
 'getUrl',
 'get_parsed_page',
 'get_property_names',
 'get_searched_namespaces',
 'get_tokens',
 'getcategoryinfo',
 'getcurrenttime',
 'getcurrenttimestamp',
 'getglobaluserinfo',
 'getmagicwords',
 'getredirtarget',
 'getuserinfo',
 'globalusage',
 'globaluserinfo',
 'hasExtension',
 'has_all_mediawiki_messages',
 'has_api',
 'has_data_repository',
 'has_extension',
 'has_group',
 'has_image_repository',
 'has_mediawiki_message',
 'has_right',
 'has_transcluded_data',
 'hide_post',
 'hide_topic',
 'image_namespace',
 'image_repository',
 'imageusage',
 'interwiki',
 'interwiki_prefix',
 'interwiki_putfirst',
 'isAllowed',
 'isBlocked',
 'isBot',
 'isInterwikiLink',
 'is_blocked',
 'is_data_repository',
 'is_image_repository',
 'is_oauth_token_available',
 'is_uploaddisabled',
 'lang',
 'language',
 'languages',
 'linksearch',
 'linkto',
 'list_to_text',
 'live_version',
 'load_board',
 'load_pages_from_pageids',
 'load_post_current_revision',
 'load_topic',
 'load_topiclist',
 'loadcoordinfo',
 'loadflowinfo',
 'loadimageinfo',
 'loadpageimage',
 'loadpageinfo',
 'loadpageprops',
 'loadrevisions',
 'local_interwiki',
 'lock_page',
 'lock_topic',
 'logevents',
 'loggedInAs',
 'logged_in',
 'login',
 'logout',
 'logpages',
 'lonelypages',
 'longpages',
 'mediawiki_message',
 'mediawiki_messages',
 'mediawiki_namespace',
 'merge_history',
 'messages',
 'moderate_post',
 'moderate_topic',
 'months_names',
 'movepage',
 'namespace',
 'namespaces',
 'newfiles',
 'newimages',
 'newpages',
 'nice_get_address',
 'nocapitalize',
 'normalizeNamespace',
 'notifications',
 'notifications_mark_read',
 'ns_index',
 'ns_normalize',
 'obsolete',
 'page_can_be_edited',
 'page_embeddedin',
 'page_exists',
 'page_extlinks',
 'page_from_repository',
 'page_isredirect',
 'page_restrictions',
 'pagebacklinks',
 'pagecategories',
 'pageimages',
 'pagelanglinks',
 'pagelinks',
 'pagename2codes',
 'pagenamecodes',
 'pagereferences',
 'pages_with_property',
 'pagetemplates',
 'patrol',
 'postData',
 'postForm',
 'prefixindex',
 'preloadpages',
 'proofread_index_ns',
 'proofread_levels',
 'proofread_page_ns',
 'protect',
 'protectedpages',
 'protection_levels',
 'protection_types',
 'purgepages',
 'randompage',
 'randompages',
 'randomredirectpage',
 'recentchanges',
 'redirect',
 'redirectRegex',
 'redirectpages',
 'reply_to_post',
 'resolvemagicwords',
 'restore_post',
 'restore_topic',
 'rollbackpage',
 'sametitle',
 'search',
 'server_time',
 'shortpages',
 'siteinfo',
 'sitename',
 'solveCaptcha',
 'special_namespace',
 'stash_info',
 'suppress_post',
 'suppress_topic',
 'template_namespace',
 'thank_post',
 'thank_revision',
 'throttle',
 'token',
 'tokens',
 'unblockuser',
 'uncategorizedcategories',
 'uncategorizedfiles',
 'uncategorizedimages',
 'uncategorizedpages',
 'uncategorizedtemplates',
 'unconnected_pages',
 'undelete_page',
 'unlock_page',
 'unusedcategories',
 'unusedfiles',
 'unusedimages',
 'unwatchedpages',
 'updateCookies',
 'upload',
 'urlEncode',
 'use_hard_category_redirects',
 'user',
 'usercontribs',
 'userinfo',
 'username',
 'users',
 'validLanguageLinks',
 'validate_tokens',
 'version',
 'wantedcategories',
 'wantedpages',
 'watch',
 'watched_pages',
 'watchlist_revs',
 'watchpage',
 'withoutinterwiki']

Flags