Pywikibot/Sites
From charlesreid1
Contents
Site Objects
Once you have installed Pywikibot (see Pywikibot/Installing) and set up Pywikibot to run scripts (see Pywikibot/Setup), you should be ready to create a Site object to represent the MediaWiki site you are scraping.
import pywikibot s = pywikibot.Site()
Useful Actions
Get all pages
To get all pages:
>>> s.allpages() >>> help(s.allpages) Help on method allpages in module pywikibot.site: allpages(start='!', prefix='', namespace=0, filterredir=None, filterlanglinks=None, minsize=None, maxsize=None, protect_type=None, protect_level=None, reverse=False, total=None, content=False, throttle=NotImplemented, limit='[deprecated name of total]', step=NotImplemented, includeredirects='[deprecated name of filterredir]') method of pywikibot.site.APISite instance Iterate pages in a single namespace.
This will return a PageGenerator object.
Get all categories
To get all categories (a category is just a page in the category namespace, so this still returns a PageGenerator object):
>>> s.allcategories() >>> help(s.allcategories) Help on method allcategories in module pywikibot.site: allcategories(start='!', prefix='', total=None, reverse=False, content=False, step=NotImplemented) method of pywikibot.site.APISite instance Iterate categories used (which need not have a Category page). Iterator yields Category objects. Note that, in practice, links that were found on pages that have been deleted may not have been removed from the database table, so this method can return false positives. @param start: Start at this category title (category need not exist). @param prefix: Only yield categories starting with this string. @param reverse: if True, iterate in reverse Unicode lexigraphic order (default: iterate in forward order) @param content: if True, load the current content of each iterated page (default False); note that this means the contents of the category description page, not the pages that are members of the category
The allcategories method will return a generator object, which only returns all categories when you iterate over the generator; to automatically iterate over the generator and make a list, wrap the call to allcategories() in a call to list():
>>> allcats = list(s.allcategories()) >>> print(allcats[0]) Category('Category:ADT')
Get pages in a category
To get all pages in a given category, you need a reference to the category page. Here we start by getting one category page reference:
>>> cat = list(s.allcategories())[0] >>> print(cat) Category('Category:ADT') >>> s.categorymembers(cat) <pywikibot.data.api.PageGenerator at 0x105705128> >>> list(s.categorymembers(cat)) [Page('Binary Search Trees/ADT'), Page('Binary Trees/ADT'), Page('Graphs/ADT'), Page('Java/Binary Search Trees'), Page('OOP Checklist'), Page('Trees/ADT')]
To get uncategorized pages:
>>> help(s.uncategorizedpages) Help on method uncategorizedpages in module pywikibot.site: uncategorizedpages(total=None, number='[deprecated name of total]', step=NotImplemented, repeat=NotImplemented) method of pywikibot.site.APISite instance Yield Pages from Special:Uncategorizedpages. @param total: number of pages to return
As before, this method returns a generator, but you can wrap this in a call to list to return all uncategorized pages at once:
>>> uncats = list(s.uncategorizedpages())
Get categories of a page
To get the categories on a given page:
>>> s.pagecategories() >>> help(s.pagecategories) Help on method pagecategories in module pywikibot.site: pagecategories(page, total=None, content=False, withSortKey=NotImplemented, step=NotImplemented) method of pywikibot.site.APISite instance Iterate categories to which page belongs. @param content: if True, load the current content of each iterated page (default False); note that this means the contents of the category description page, not the pages contained in the category
All Available Methods
$ ipython imporPython 3.6.4 (default, Jan 12 2018, 05:16:29) Type 'copyright', 'credits' or 'license' for more information IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help. In [1]: import pywikibot In [2]: s = pywikibot.Site() In [3]: dir(s) Out[3]: ['OnErrorExc', '_BaseSite__code', '_BaseSite__family', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_build_namespaces', '_cache_proofreadinfo', '_cmpkey', '_dl_errors', '_ep_errors', '_ep_text_overrides', '_generator', '_get_titles_with_hash', '_interwiki_urls', '_interwikimap', '_locked_pages', '_loginstatus', '_mh_errors', '_msgcache', '_mv_errors', '_pagemutex', '_paraminfo', '_patrol_errors', '_protect_errors', '_rb_errors', '_relogin', '_request', '_request_class', '_simple_request', '_siteinfo', '_update_page', '_username', 'allcategories', 'allimages', 'alllinks', 'allpages', 'allusers', 'ancientpages', 'article_path', 'assert_valid_iter_params', 'blocks', 'blockuser', 'botusers', 'broken_redirects', 'case', 'categories', 'category_namespace', 'category_namespaces', 'category_on_one_line', 'categoryinfo', 'categorymembers', 'checkBlocks', 'checkCharset', 'code', 'compare', 'cookies', 'create_new_topic', 'data_repository', 'dbName', 'deadendpages', 'delete_post', 'delete_topic', 'deletedrevs', 'deletepage', 'disambcategory', 'doc_subpage', 'double_redirects', 'editpage', 'expand_text', 'exturlusage', 'fam', 'family', 'forceLogin', 'fromDBName', 'getExpandedString', 'getFilesFromAnHash', 'getImagesFromAnHash', 'getNamespaceIndex', 'getParsedString', 'getPatrolToken', 'getSite', 'getToken', 'getUrl', 'get_parsed_page', 'get_property_names', 'get_searched_namespaces', 'get_tokens', 'getcategoryinfo', 'getcurrenttime', 'getcurrenttimestamp', 'getglobaluserinfo', 'getmagicwords', 'getredirtarget', 'getuserinfo', 'globalusage', 'globaluserinfo', 'hasExtension', 'has_all_mediawiki_messages', 'has_api', 'has_data_repository', 'has_extension', 'has_group', 'has_image_repository', 'has_mediawiki_message', 'has_right', 'has_transcluded_data', 'hide_post', 'hide_topic', 'image_namespace', 'image_repository', 'imageusage', 'interwiki', 'interwiki_prefix', 'interwiki_putfirst', 'isAllowed', 'isBlocked', 'isBot', 'isInterwikiLink', 'is_blocked', 'is_data_repository', 'is_image_repository', 'is_oauth_token_available', 'is_uploaddisabled', 'lang', 'language', 'languages', 'linksearch', 'linkto', 'list_to_text', 'live_version', 'load_board', 'load_pages_from_pageids', 'load_post_current_revision', 'load_topic', 'load_topiclist', 'loadcoordinfo', 'loadflowinfo', 'loadimageinfo', 'loadpageimage', 'loadpageinfo', 'loadpageprops', 'loadrevisions', 'local_interwiki', 'lock_page', 'lock_topic', 'logevents', 'loggedInAs', 'logged_in', 'login', 'logout', 'logpages', 'lonelypages', 'longpages', 'mediawiki_message', 'mediawiki_messages', 'mediawiki_namespace', 'merge_history', 'messages', 'moderate_post', 'moderate_topic', 'months_names', 'movepage', 'namespace', 'namespaces', 'newfiles', 'newimages', 'newpages', 'nice_get_address', 'nocapitalize', 'normalizeNamespace', 'notifications', 'notifications_mark_read', 'ns_index', 'ns_normalize', 'obsolete', 'page_can_be_edited', 'page_embeddedin', 'page_exists', 'page_extlinks', 'page_from_repository', 'page_isredirect', 'page_restrictions', 'pagebacklinks', 'pagecategories', 'pageimages', 'pagelanglinks', 'pagelinks', 'pagename2codes', 'pagenamecodes', 'pagereferences', 'pages_with_property', 'pagetemplates', 'patrol', 'postData', 'postForm', 'prefixindex', 'preloadpages', 'proofread_index_ns', 'proofread_levels', 'proofread_page_ns', 'protect', 'protectedpages', 'protection_levels', 'protection_types', 'purgepages', 'randompage', 'randompages', 'randomredirectpage', 'recentchanges', 'redirect', 'redirectRegex', 'redirectpages', 'reply_to_post', 'resolvemagicwords', 'restore_post', 'restore_topic', 'rollbackpage', 'sametitle', 'search', 'server_time', 'shortpages', 'siteinfo', 'sitename', 'solveCaptcha', 'special_namespace', 'stash_info', 'suppress_post', 'suppress_topic', 'template_namespace', 'thank_post', 'thank_revision', 'throttle', 'token', 'tokens', 'unblockuser', 'uncategorizedcategories', 'uncategorizedfiles', 'uncategorizedimages', 'uncategorizedpages', 'uncategorizedtemplates', 'unconnected_pages', 'undelete_page', 'unlock_page', 'unusedcategories', 'unusedfiles', 'unusedimages', 'unwatchedpages', 'updateCookies', 'upload', 'urlEncode', 'use_hard_category_redirects', 'user', 'usercontribs', 'userinfo', 'username', 'users', 'validLanguageLinks', 'validate_tokens', 'version', 'wantedcategories', 'wantedpages', 'watch', 'watched_pages', 'watchlist_revs', 'watchpage', 'withoutinterwiki']