Better directory iterator and faster os.walk(), now in the Python 3.5 stdlib (LADI project)
Go to file
Ben Hoyt 982e6ba60e Bump up version number with recent Solaris fix and other updates 2019-03-09 12:51:39 -05:00
test drop python < 3.4 support 2019-02-19 08:03:26 +01:00
.gitattributes
.gitignore gitignore pycharm files 2019-02-19 06:38:46 +01:00
.travis.yml macOS: test on py27+py36 (xc9.4), py27+py37 (xc10.1) 2019-02-21 10:54:19 +01:00
LICENSE.txt Forgot changes from last commit. Attempt as Linux version 2013-05-13 22:13:23 +12:00
MANIFEST.in Make MANIFEST.in more explicit, so that test/test_dir (with Unicode filenames and whatnot) isn't included in PyPI archive. Fixes #74. Thanks @andrask for the initial pull request (#72). 2017-02-20 11:55:36 -05:00
README.rst drop python < 3.4 support 2019-02-19 08:03:26 +01:00
_scandir.c fix grammar 2018-05-29 22:38:11 +03:00
appveyor.yml Add Python 3.7 support to Appveyor config 2018-08-02 12:12:57 -04:00
benchmark.py Make benchmark.py work again on Python 2 (avoid "yield from"). 2015-04-24 23:55:32 -04:00
build.cmd Try to fix Windows Python 3.4 build 2017-09-28 11:30:06 -04:00
osdefs.h A few fixes for Python 3.2 (and PyPy) based on https://github.com/benhoyt/scandir/pull/53 2016-01-02 18:44:43 -05:00
scandir.py Bump up version number with recent Solaris fix and other updates 2019-03-09 12:51:39 -05:00
setup.py travis-ci: run py37 tests on xenial, fixes #115 2019-02-20 05:29:44 +01:00
tox.ini remove tox-travis, use tox features 2019-02-21 10:26:48 +01:00
winreparse.h Add _scandir2.c implementation based on CPython 3.5 implementation -- only works on Python 3 and Windows for now, so don't add to setup.py yet. 2015-04-24 23:06:01 -04:00

README.rst

scandir, a better directory iterator and faster os.walk()
=========================================================

.. image:: https://img.shields.io/pypi/v/scandir.svg
   :target: https://pypi.python.org/pypi/scandir
   :alt: scandir on PyPI (Python Package Index)

.. image:: https://travis-ci.org/benhoyt/scandir.svg?branch=master
   :target: https://travis-ci.org/benhoyt/scandir
   :alt: Travis CI tests (Linux)

.. image:: https://ci.appveyor.com/api/projects/status/github/benhoyt/scandir?branch=master&svg=true
   :target: https://ci.appveyor.com/project/benhoyt/scandir
   :alt: Appveyor tests (Windows)


``scandir()`` is a directory iteration function like ``os.listdir()``,
except that instead of returning a list of bare filenames, it yields
``DirEntry`` objects that include file type and stat information along
with the name. Using ``scandir()`` increases the speed of ``os.walk()``
by 2-20 times (depending on the platform and file system) by avoiding
unnecessary calls to ``os.stat()`` in most cases.


Now included in a Python near you!
----------------------------------

``scandir`` has been included in the Python 3.5 standard library as
``os.scandir()``, and the related performance improvements to
``os.walk()`` have also been included. So if you're lucky enough to be
using Python 3.5 (release date September 13, 2015) you get the benefit
immediately, otherwise just
`download this module from PyPI <https://pypi.python.org/pypi/scandir>`_,
install it with ``pip install scandir``, and then do something like
this in your code:

.. code-block:: python

    # Use the built-in version of scandir/walk if possible, otherwise
    # use the scandir module version
    try:
        from os import scandir, walk
    except ImportError:
        from scandir import scandir, walk

`PEP 471 <https://www.python.org/dev/peps/pep-0471/>`_, which is the
PEP that proposes including ``scandir`` in the Python standard library,
was `accepted <https://mail.python.org/pipermail/python-dev/2014-July/135561.html>`_
in July 2014 by Victor Stinner, the BDFL-delegate for the PEP.

This ``scandir`` module is intended to work on Python 2.7+ and Python
3.4+ (and it has been tested on those versions).


Background
----------

Python's built-in ``os.walk()`` is significantly slower than it needs to be,
because -- in addition to calling ``listdir()`` on each directory -- it calls
``stat()`` on each file to determine whether the filename is a directory or not.
But both ``FindFirstFile`` / ``FindNextFile`` on Windows and ``readdir`` on Linux/OS
X already tell you whether the files returned are directories or not, so
no further ``stat`` system calls are needed. In short, you can reduce the number
of system calls from about 2N to N, where N is the total number of files and
directories in the tree.

In practice, removing all those extra system calls makes ``os.walk()`` about
**7-50 times as fast on Windows, and about 3-10 times as fast on Linux and Mac OS
X.** So we're not talking about micro-optimizations. See more benchmarks
in the "Benchmarks" section below.

Somewhat relatedly, many people have also asked for a version of
``os.listdir()`` that yields filenames as it iterates instead of returning them
as one big list. This improves memory efficiency for iterating very large
directories.

So as well as a faster ``walk()``, scandir adds a new ``scandir()`` function.
They're pretty easy to use, but see "The API" below for the full docs.


Benchmarks
----------

Below are results showing how many times as fast ``scandir.walk()`` is than
``os.walk()`` on various systems, found by running ``benchmark.py`` with no
arguments:

====================   ==============   =============
System version         Python version   Times as fast
====================   ==============   =============
Windows 7 64-bit       2.7.7 64-bit     10.4
Windows 7 64-bit SSD   2.7.7 64-bit     10.3
Windows 7 64-bit NFS   2.7.6 64-bit     36.8
Windows 7 64-bit SSD   3.4.1 64-bit     9.9
Windows 7 64-bit SSD   3.5.0 64-bit     9.5
Ubuntu 14.04 64-bit    2.7.6 64-bit     5.8
Mac OS X 10.9.3        2.7.5 64-bit     3.8
====================   ==============   =============

All of the above tests were done using the fast C version of scandir
(source code in ``_scandir.c``).

Note that the gains are less than the above on smaller directories and greater
on larger directories. This is why ``benchmark.py`` creates a test directory
tree with a standardized size.


The API
-------

walk()
~~~~~~

The API for ``scandir.walk()`` is exactly the same as ``os.walk()``, so just
`read the Python docs <https://docs.python.org/3.5/library/os.html#os.walk>`_.

scandir()
~~~~~~~~~

The full docs for ``scandir()`` and the ``DirEntry`` objects it yields are
available in the `Python documentation here <https://docs.python.org/3.5/library/os.html#os.scandir>`_. 
But below is a brief summary as well.

    scandir(path='.') -> iterator of DirEntry objects for given path

Like ``listdir``, ``scandir`` calls the operating system's directory
iteration system calls to get the names of the files in the given
``path``, but it's different from ``listdir`` in two ways:

* Instead of returning bare filename strings, it returns lightweight
  ``DirEntry`` objects that hold the filename string and provide
  simple methods that allow access to the additional data the
  operating system may have returned.

* It returns a generator instead of a list, so that ``scandir`` acts
  as a true iterator instead of returning the full list immediately.

``scandir()`` yields a ``DirEntry`` object for each file and
sub-directory in ``path``. Just like ``listdir``, the ``'.'``
and ``'..'`` pseudo-directories are skipped, and the entries are
yielded in system-dependent order. Each ``DirEntry`` object has the
following attributes and methods:

* ``name``: the entry's filename, relative to the scandir ``path``
  argument (corresponds to the return values of ``os.listdir``)

* ``path``: the entry's full path name (not necessarily an absolute
  path) -- the equivalent of ``os.path.join(scandir_path, entry.name)``

* ``is_dir(*, follow_symlinks=True)``: similar to
  ``pathlib.Path.is_dir()``, but the return value is cached on the
  ``DirEntry`` object; doesn't require a system call in most cases;
  don't follow symbolic links if ``follow_symlinks`` is False

* ``is_file(*, follow_symlinks=True)``: similar to
  ``pathlib.Path.is_file()``, but the return value is cached on the
  ``DirEntry`` object; doesn't require a system call in most cases; 
  don't follow symbolic links if ``follow_symlinks`` is False

* ``is_symlink()``: similar to ``pathlib.Path.is_symlink()``, but the
  return value is cached on the ``DirEntry`` object; doesn't require a
  system call in most cases

* ``stat(*, follow_symlinks=True)``: like ``os.stat()``, but the
  return value is cached on the ``DirEntry`` object; does not require a
  system call on Windows (except for symlinks); don't follow symbolic links
  (like ``os.lstat()``) if ``follow_symlinks`` is False

* ``inode()``: return the inode number of the entry; the return value
  is cached on the ``DirEntry`` object

Here's a very simple example of ``scandir()`` showing use of the
``DirEntry.name`` attribute and the ``DirEntry.is_dir()`` method:

.. code-block:: python

    def subdirs(path):
        """Yield directory names not starting with '.' under given path."""
        for entry in os.scandir(path):
            if not entry.name.startswith('.') and entry.is_dir():
                yield entry.name

This ``subdirs()`` function will be significantly faster with scandir
than ``os.listdir()`` and ``os.path.isdir()`` on both Windows and POSIX
systems, especially on medium-sized or large directories.


Further reading
---------------

* `The Python docs for scandir <https://docs.python.org/3.5/library/os.html#os.scandir>`_
* `PEP 471 <https://www.python.org/dev/peps/pep-0471/>`_, the
  (now-accepted) Python Enhancement Proposal that proposed adding
  ``scandir`` to the standard library -- a lot of details here,
  including rejected ideas and previous discussion


Flames, comments, bug reports
-----------------------------

Please send flames, comments, and questions about scandir to Ben Hoyt:

http://benhoyt.com/

File bug reports for the version in the Python 3.5 standard library
`here <https://docs.python.org/3.5/bugs.html>`_, or file bug reports
or feature requests for this module at the GitHub project page:

https://github.com/benhoyt/scandir