NEP 42 — New and extensible DTypes#
- title:
New and extensible DTypes
- Author:
Sebastian Berg
- Author:
Ben Nathanson
- Author:
Marten van Kerkwijk
- Status:
Accepted
- Type:
Standard
- Created:
2019-07-17
- Resolution:
https://mail.python.org/pipermail/numpy-discussion/2020-October/081038.html
Note
This NEP is third in a series:
Abstract#
NumPy’s dtype architecture is monolithic – each dtype is an instance of a single class. There’s no principled way to expand it for new dtypes, and the code is difficult to read and maintain.
As NEP 41 explains, we are proposing a new architecture that is
modular and open to user additions. dtypes will derive from a new DType
class serving as the extension point for new types. np.dtype("float64")
will return an instance of a Float64
class, a subclass of root class
np.dtype
.
This NEP is one of two that lay out the design and API of this new architecture. This NEP addresses dtype implementation; NEP 43 addresses universal functions.
Note
Details of the private and external APIs may change to reflect user comments and implementation constraints. The underlying principles and choices should not change significantly.
Motivation and scope#
Our goal is to allow user code to create fully featured dtypes for a broad variety of uses, from physical units (such as meters) to domain-specific representations of geometric objects. NEP 41 describes a number of these new dtypes and their benefits.
Any design supporting dtypes must consider:
How shape and dtype are determined when an array is created
How array elements are stored and accessed
The rules for casting dtypes to other dtypes
In addition:
We want dtypes to comprise a class hierarchy open to new types and to subhierarchies, as motivated in NEP 41.
And to provide this,
We need to define a user API.
All these are the subjects of this NEP.
The class hierarchy, its relation to the Python scalar types, and its important attributes are described in nep42_DType class.
The functionality that will support dtype casting is described in Casting.
The implementation of item access and storage, and the way shape and dtype are determined when creating an array, are described in Coercion to and from Python objects.
The functionality for users to define their own DTypes is described in Public C-API.
The API here and in NEP 43 is entirely on the C side. A Python-side version will be proposed in a future NEP. A future Python API is expected to be similar, but provide a more convenient API to reuse the functionality of existing DTypes. It could also provide shorthands to create structured DTypes similar to Python’s dataclasses.
Backward compatibility#
The disruption is expected to be no greater than that of a typical NumPy release.
The main issues are noted in NEP 41 and will mostly affect heavy users of the NumPy C-API.
Eventually we will want to deprecate the API currently used for creating user-defined dtypes.
Small, rarely noticed inconsistencies are likely to change. Examples:
np.array(np.nan, dtype=np.int64)
behaves differently fromnp.array([np.nan], dtype=np.int64)
with the latter raising an error. This may require identical results (either both error or both succeed).np.array([array_like])
sometimes behaves differently fromnp.array([np.array(array_like)])
array operations may or may not preserve dtype metadata
Documentation that describes the internal structure of dtypes will need to be updated.
The new code must pass NumPy’s regular test suite, giving some assurance that the changes are compatible with existing code.
Usage and impact#
We believe the few structures in this section are sufficient to consolidate NumPy’s present functionality and also to support complex user-defined DTypes.
The rest of the NEP fills in details and provides support for the claim.
Again, though Python is used for illustration, the implementation is a C API only; a future NEP will tackle the Python API.
After implementing this NEP, creating a DType will be possible by implementing the following outlined DType base class, that is further described in nep42_DType class:
class DType(np.dtype):
type : type # Python scalar type
parametric : bool # (may be indicated by superclass)
@property
def canonical(self) -> bool:
raise NotImplementedError
def ensure_canonical(self : DType) -> DType:
raise NotImplementedError
For casting, a large part of the functionality is provided by the “methods” stored
in _castingimpl
@classmethod
def common_dtype(cls : DTypeMeta, other : DTypeMeta) -> DTypeMeta:
raise NotImplementedError
def common_instance(self : DType, other : DType) -> DType:
raise NotImplementedError
# A mapping of "methods" each detailing how to cast to another DType
# (further specified at the end of the section)
_castingimpl = {}
For array-coercion, also part of casting:
def __dtype_setitem__(self, item_pointer, value):
raise NotImplementedError
def __dtype_getitem__(self, item_pointer, base_obj) -> object:
raise NotImplementedError
@classmethod
def __discover_descr_from_pyobject__(cls, obj : object) -> DType:
raise NotImplementedError
# initially private:
@classmethod
def _known_scalar_type(cls, obj : object) -> bool:
raise NotImplementedError
Other elements of the casting implementation is the CastingImpl
:
casting = Union["safe", "same_kind", "unsafe"]
class CastingImpl:
# Object describing and performing the cast
casting : casting
def resolve_descriptors(self, Tuple[DTypeMeta], Tuple[DType|None] : input) -> (casting, Tuple[DType]):
raise NotImplementedError
# initially private:
def _get_loop(...) -> lowlevel_C_loop:
raise NotImplementedError
which describes the casting from one DType to another. In
NEP 43 this CastingImpl
object is used unchanged to
support universal functions.
Note that the name CastingImpl
here will be generically called
ArrayMethod
to accommodate both casting and universal functions.
Definitions#
- dtype#
The dtype instance; this is the object attached to a numpy array.
- DType#
Any subclass of the base type
np.dtype
.- coercion#
Conversion of Python types to NumPy arrays and values stored in a NumPy array.
- cast#
Conversion of an array to a different dtype.
- parametric type#
A dtype whose representation can change based on a parameter value, like a string dtype with a length parameter. All members of the current
flexible
dtype class are parametric. See NEP 40.- promotion#
Finding a dtype that can perform an operation on a mix of dtypes without loss of information.
- safe cast#
A cast is safe if no information is lost when changing type.
On the C level we use descriptor
or descr
to mean
dtype instance. In the proposed C-API, these terms will distinguish
dtype instances from DType classes.
Note
NumPy has an existing class hierarchy for scalar types, as seen in the figure of NEP 40, and the new DType hierarchy will resemble it. The types are used as an attribute of the single dtype class in the current NumPy; they’re not dtype classes. They neither harm nor help this work.
The DType class#
This section reviews the structure underlying the proposed DType class, including the type hierarchy and the use of abstract DTypes.
Class getter#
To create a DType instance from a scalar type users now call
np.dtype
(for instance, np.dtype(np.int64)
). Sometimes it is
also necessary to access the underlying DType class; this comes up in
particular with type hinting because the “type” of a DType instance is
the DType class. Taking inspiration from type hinting, we propose the
following getter syntax:
np.dtype[np.int64]
to get the DType class corresponding to a scalar type. The notation works equally well with built-in and user-defined DTypes.
This getter eliminates the need to create an explicit name for every
DType, crowding the np
namespace; the getter itself signifies the
type. It also opens the possibility of making np.ndarray
generic
over DType class using annotations like:
np.ndarray[np.dtype[np.float64]]
The above is fairly verbose, so it is possible that we will include aliases like:
Float64 = np.dtype[np.float64]
in numpy.typing
, thus keeping annotations concise but still
avoiding crowding the np
namespace as discussed above. For a
user-defined DType:
class UserDtype(dtype): ...
one can do np.ndarray[UserDtype]
, keeping annotations concise in
that case without introducing boilerplate in NumPy itself. For a
user-defined scalar type:
class UserScalar(generic): ...
we would need to add a typing overload to dtype
:
@overload
__new__(cls, dtype: Type[UserScalar], ...) -> UserDtype
to allow np.dtype[UserScalar]
.
The initial implementation probably will return only concrete (not abstract) DTypes.
This item is still under review.
Hierarchy and abstract classes#
We will use abstract classes as building blocks of our extensible DType class hierarchy.
Abstract classes are inherited cleanly, in principle allowing checks like
isinstance(np.dtype("float64"), np.inexact)
.Abstract classes allow a single piece of code to handle a multiplicity of input types. Code written to accept Complex objects can work with numbers of any precision; the precision of the results is determined by the precision of the arguments.
There’s room for user-created families of DTypes. We can envision an abstract
Unit
class for physical units, with a concrete subclass likeFloat64Unit
. CallingUnit(np.float64, "m")
(m
for meters) would be equivalent toFloat64Unit("m")
.The implementation of universal functions in NEP 43 may require a class hierarchy.
Example: A NumPy Categorical
class would be a match for pandas
Categorical
objects, which can contain integers or general Python objects.
NumPy needs a DType that it can assign a Categorical to, but it also needs
DTypes like CategoricalInt64
and CategoricalObject
such that
common_dtype(CategoricalInt64, String)
raises an error, but
common_dtype(CategoricalObject, String)
returns an object
DType. In
our scheme, Categorical
is an abstract type with CategoricalInt64
and
CategoricalObject
subclasses.
Rules for the class structure, illustrated below:
Abstract DTypes cannot be instantiated. Instantiating an abstract DType raises an error, or perhaps returns an instance of a concrete subclass. Raising an error will be the default behavior and may be required initially.
While abstract DTypes may be superclasses, they may also act like Python’s abstract base classes (ABC) allowing registration instead of subclassing. It may be possible to simply use or inherit from Python ABCs.
Concrete DTypes may not be subclassed. In the future this might be relaxed to allow specialized implementations such as a GPU float64 subclassing a NumPy float64.
The
Julia language
has a similar prohibition against subclassing concrete types.
For example methods such as the later __common_instance__
or
__common_dtype__
cannot work for a subclass unless they were designed
very carefully.
It helps avoid unintended vulnerabilities to implementation changes that
result from subclassing types that were not written to be subclassed.
We believe that the DType API should rather be extended to simplify wrapping
of existing functionality.
The DType class requires C-side storage of methods and additional information,
to be implemented by a DTypeMeta
class. Each DType
class is an
instance of DTypeMeta
with a well-defined and extensible interface;
end users ignore it.
Miscellaneous methods and attributes#
This section collects definitions in the DType class that are not used in casting and array coercion, which are described in detail below.
Existing dtype methods (
numpy.dtype
) and C-side fields are preserved.DType.type
replacesdtype.type
. Unless a use case arises,dtype.type
will be deprecated. This indicates a Python scalar type which represents the same values as the DType. This is the same type as used in the proposed Class getter and for DType discovery during array coercion. (This can may also be set for abstract DTypes, this is necessary for array coercion.)A new
self.canonical
property generalizes the notion of byte order to indicate whether data has been stored in a default/canonical way. For existing code, “canonical” will just signify native byte order, but it can take on new meanings in new DTypes – for instance, to distinguish a complex-conjugated instance of Complex which storesreal - imag
instead ofreal + imag
. The ISNBO (“is native byte order”) flag might be repurposed as the canonical flag.Support is included for parametric DTypes. A DType will be deemed parametric if it inherits from ParametricDType.
DType methods may resemble or even reuse existing Python slots. Thus Python special slots are off-limits for user-defined DTypes (for instance, defining
Unit("m") > Unit("cm")
), since we may want to develop a meaning for these operators that is common to all DTypes.Sorting functions are moved to the DType class. They may be implemented by defining a method
dtype_get_sort_function(self, sortkind="stable") -> sortfunction
that must returnNotImplemented
if the givensortkind
is not known.Functions that cannot be removed are implemented as special methods. Many of these were previously defined part of the
PyArray_ArrFuncs
slot of the dtype instance (PyArray_Descr *
) and include functions such asnonzero
,fill
(used fornp.arange
), andfromstr
(used to parse text files). These old methods will be deprecated and replacements following the new design principles added. The API is not defined here. Since these methods can be deprecated and renamed replacements added, it is acceptable if these new methods have to be modified.Use of
kind
for non-built-in types is discouraged in favor ofisinstance
checks.kind
will return the__qualname__
of the object to ensure uniqueness for all DTypes. On the C side,kind
andchar
are set to\0
(NULL character). Whilekind
will be discouraged, the currentnp.issubdtype
may remain the preferred method for this type of check.A method
ensure_canonical(self) -> dtype
returns a new dtype (orself
) with thecanonical
flag set.Since NumPy’s approach is to provide functionality through unfuncs, functions like sorting that will be implemented in DTypes might eventually be reimplemented as generalized ufuncs.
Casting#
We review here the operations related to casting arrays:
Finding the “common dtype,” returned by
numpy.promote_types()
andnumpy.result_type()
The result of calling
numpy.can_cast()
We show how casting arrays with astype(new_dtype)
will be implemented.
Common DType operations#
When input types are mixed, a first step is to find a DType that can hold the result without loss of information – a “common DType.”
Array coercion and concatenation both return a common dtype instance. Most universal functions use the common DType for dispatching, though they might not use it for a result (for instance, the result of a comparison is always bool).
We propose the following implementation:
For two DType classes:
__common_dtype__(cls, other : DTypeMeta) -> DTypeMeta
Returns a new DType, often one of the inputs, which can represent values of both input DTypes. This should usually be minimal: the common DType of
Int16
andUint16
isInt32
and notInt64
.__common_dtype__
may return NotImplemented to defer to other and, like Python operators, subclasses take precedence (their__common_dtype__
method is tried first).For two instances of the same DType:
__common_instance__(self: SelfT, other : SelfT) -> SelfT
For nonparametric built-in dtypes, this returns a canonicalized copy of
self
, preserving metadata. For nonparametric user types, this provides a default implementation.For instances of different DTypes, for example
>float64
andS8
, the operation is done in three steps:Float64.__common_dtype__(type(>float64), type(S8))
returnsString
(or defers toString.__common_dtype__
).The casting machinery (explained in detail below) provides the information that
">float64"
casts to"S32"
String.__common_instance__("S8", "S32")
returns the final"S32"
.
The benefit of this handoff is to reduce duplicated code and keep concerns separate. DType implementations don’t need to know how to cast, and the results of casting can be extended to new types, such as a new string encoding.
This means the implementation will work like this:
def common_dtype(DType1, DType2):
common_dtype = type(dtype1).__common_dtype__(type(dtype2))
if common_dtype is NotImplemented:
common_dtype = type(dtype2).__common_dtype__(type(dtype1))
if common_dtype is NotImplemented:
raise TypeError("no common dtype")
return common_dtype
def promote_types(dtype1, dtype2):
common = common_dtype(type(dtype1), type(dtype2))
if type(dtype1) is not common:
# Find what dtype1 is cast to when cast to the common DType
# by using the CastingImpl as described below:
castingimpl = get_castingimpl(type(dtype1), common)
safety, (_, dtype1) = castingimpl.resolve_descriptors(
(common, common), (dtype1, None))
assert safety == "safe" # promotion should normally be a safe cast
if type(dtype2) is not common:
# Same as above branch for dtype1.
if dtype1 is not dtype2:
return common.__common_instance__(dtype1, dtype2)
Some of these steps may be optimized for nonparametric DTypes.
Since the type returned by __common_dtype__
is not necessarily one of the
two arguments, it’s not equivalent to NumPy’s “safe” casting.
Safe casting works for np.promote_types(int16, int64)
, which returns
int64
, but fails for:
np.promote_types("int64", "float32") -> np.dtype("float64")
It is the responsibility of the DType author to ensure that the inputs
can be safely cast to the __common_dtype__
.
Exceptions may apply. For example, casting int32
to
a (long enough) string is at least at this time considered “safe”.
However np.promote_types(int32, String)
will not be defined.
Example:
object
always chooses object
as the common DType. For
datetime64
type promotion is defined with no other datatype, but if
someone were to implement a new higher precision datetime, then:
HighPrecisionDatetime.__common_dtype__(np.dtype[np.datetime64])
would return HighPrecisionDatetime
, and the casting implementation,
as described below, may need to decide how to handle the datetime unit.
Alternatives:
We’re pushing the decision on common DTypes to the DType classes. Suppose instead we could turn to a universal algorithm based on safe casting, imposing a total order on DTypes and returning the first type that both arguments could cast to safely.
It would be difficult to devise a reasonable total order, and it would have to accept new entries. Beyond that, the approach is flawed because importing a type can change the behavior of a program. For example, a program requiring the common DType of
int16
anduint16
would ordinarily get the built-in typeint32
as the first match; if the program addsimport int24
, the first match becomesint24
and the smaller type might make the program overflow for the first time. [1]A more flexible common DType could be implemented in the future where
__common_dtype__
relies on information from the casting logic. Since__commond_dtype__
is a method a such a default implementation could be added at a later time.The three-step handling of differing dtypes could, of course, be coalesced. It would lose the value of splitting in return for a possibly faster execution. But few cases would benefit. Most cases, such as array coercion, involve a single Python type (and thus dtype).
The cast operation#
Casting is perhaps the most complex and interesting DType operation. It is much like a typical universal function on arrays, converting one input to a new output, with two distinctions:
Casting always requires an explicit output datatype.
The NumPy iterator API requires access to functions that are lower-level than what universal functions currently need.
Casting can be complex and may not implement all details of each input datatype (such as non-native byte order or unaligned access). So a complex type conversion might entail 3 steps:
The input datatype is normalized and prepared for the cast.
The cast is performed.
The result, which is in a normalized form, is cast to the requested form (non-native byte order).
Further, NumPy provides different casting kinds or safety specifiers:
equivalent
, allowing only byte-order changessafe
, requiring a type large enough to preserve valuesame_kind
, requiring a safe cast or one within a kind, like float64 to float32unsafe
, allowing any data conversion
and in some cases a cast may be just a view.
We need to support the two current signatures of arr.astype
:
For DTypes:
arr.astype(np.String)
current spelling
arr.astype("S")
np.String
can be an abstract DType
For dtypes:
arr.astype(np.dtype("S8"))
We also have two signatures of np.can_cast
:
Instance to class:
np.can_cast(dtype, DType, "safe")
Instance to instance:
np.can_cast(dtype, other_dtype, "safe")
On the Python level dtype
is overloaded to mean class or instance.
A third can_cast
signature, np.can_cast(DType, OtherDType, "safe")
,may be used
internally but need not be exposed to Python.
During DType creation, DTypes will be able to pass a list of CastingImpl
objects, which can define casting to and from the DType.
One of them should define the cast between instances of that DType. It can be omitted if the DType has only a single implementation and is nonparametric.
Each CastingImpl
has a distinct DType signature:
CastingImpl[InputDtype, RequestedDtype]
and implements the following methods and attributes:
To report safeness,
resolve_descriptors(self, Tuple[DTypeMeta], Tuple[DType|None] : input) -> casting, Tuple[DType]
.The
casting
output reports safeness (safe, unsafe, or same-kind), and the tuple is used for more multistep casting, as in the example below.To get a casting function,
get_loop(...) -> function_to_handle_cast (signature to be decided)
returns a low-level implementation of a strided casting function (“transfer function”) capable of performing the cast.
Initially the implementation will be private, and users will only be able to provide strided loops with the signature.
For performance, a
casting
attribute taking a value ofequivalent
,safe
,unsafe
, orsame-kind
.
Performing a cast
The above figure illustrates a multistep
cast of an int24
with a value of 42
to a string of length 20
("S20"
).
We’ve picked an example where the implementer has only provided limited
functionality: a function to cast an int24
to an S8
string (which can
hold all 24-bit integers). This means multiple conversions are needed.
The full process is:
Call
CastingImpl[Int24, String].resolve_descriptors((Int24, String), (int24, "S20"))
.This provides the information that
CastingImpl[Int24, String]
only implements the cast ofint24
to"S8"
.Since
"S8"
does not match"S20"
, useCastingImpl[String, String].get_loop()
to find the transfer (casting) function to convert an
"S8"
into an"S20"
Fetch the transfer function to convert an
int24
to an"S8"
usingCastingImpl[Int24, String].get_loop()
Perform the actual cast using the two transfer functions:
int24(42) -> S8("42") -> S20("42")
.resolve_descriptors
allows the implementation fornp.array(42, dtype=int24).astype(String)
to call
CastingImpl[Int24, String].resolve_descriptors((Int24, String), (int24, None))
.In this case the result of
(int24, "S8")
defines the correct cast:np.array(42, dtype=int24).astype(String) == np.array("42", dtype="S8")
.
Casting safety
To compute np.can_cast(int24, "S20", casting="safe")
, only the
resolve_descriptors
function is required and
is called in the same way as in the figure describing a cast.
In this case, the calls to resolve_descriptors
, will also provide the
information that int24 -> "S8"
as well as "S8" -> "S20"
are safe
casts, and thus also the int24 -> "S20"
is a safe cast.
In some cases, no cast is necessary. For example, on most Linux systems
np.dtype("long")
and np.dtype("longlong")
are different dtypes but are
both 64-bit integers. In this case, the cast can be performed using
long_arr.view("longlong")
. The information that a cast is a view will be
handled by an additional flag. Thus the casting
can have the 8 values in
total: the original 4 of equivalent
, safe
, unsafe
, and same-kind
,
plus equivalent+view
, safe+view
, unsafe+view
, and
same-kind+view
. NumPy currently defines dtype1 == dtype2
to be True
only if byte order matches. This functionality can be replaced with the
combination of “equivalent” casting and the “view” flag.
(For more information on the resolve_descriptors
signature see the
Public C-API section below and NEP 43.)
Casting between instances of the same DType
To keep down the number of casting steps, CastingImpl must be capable of any conversion between all instances of this DType.
In general the DType implementer must include CastingImpl[DType, DType]
unless there is only a singleton instance.
General multistep casting
We could implement certain casts, such as int8
to int24
,
even if the user provides only an int16 -> int24
cast. This proposal does
not provide that, but future work might find such casts dynamically, or at least
allow resolve_descriptors
to return arbitrary dtypes
.
If CastingImpl[Int8, Int24].resolve_descriptors((Int8, Int24), (int8, int24))
returns (int16, int24)
, the actual casting process could be extended to include
the int8 -> int16
cast. This adds a step.
Example:
The implementation for casting integers to datetime would generally
say that this cast is unsafe (because it is always an unsafe cast).
Its resolve_descriptors
function may look like:
def resolve_descriptors(self, DTypes, given_dtypes):
from_dtype, to_dtype = given_dtypes
from_dtype = from_dtype.ensure_canonical() # ensure not byte-swapped
if to_dtype is None:
raise TypeError("Cannot convert to a NumPy datetime without a unit")
to_dtype = to_dtype.ensure_canonical() # ensure not byte-swapped
# This is always an "unsafe" cast, but for int64, we can represent
# it by a simple view (if the dtypes are both canonical).
# (represented as C-side flags here).
safety_and_view = NPY_UNSAFE_CASTING | _NPY_CAST_IS_VIEW
return safety_and_view, (from_dtype, to_dtype)
Note
While NumPy currently defines integer-to-datetime casts, with the possible
exception of the unit-less timedelta64
it may be better to not define
these casts at all. In general we expect that user defined DTypes will be
using custom methods such as unit.drop_unit(arr)
or arr *
unit.seconds
.
Alternatives:
Our design objectives are: - Minimize the number of DType methods and avoid code duplication. - Mirror the implementation of universal functions.
The decision to use only the DType classes in the first step of finding the correct
CastingImpl
in addition to definingCastingImpl.casting
, allows to retain the current default implementation of__common_dtype__
for existing user defined dtypes, which could be expanded in the future.The split into multiple steps may seem to add complexity rather than reduce it, but it consolidates the signatures of
np.can_cast(dtype, DTypeClass)
andnp.can_cast(dtype, other_dtype)
.Further, the API guarantees separation of concerns for user DTypes. The user
Int24
dtype does not have to handle all string lengths if it does not wish to do so. Further, an encoding added to theString
DType would not affect the overall cast. Theresolve_descriptors
function can keep returning the default encoding and theCastingImpl[String, String]
can take care of any necessary encoding changes.The main alternative is moving most of the information that is here pushed into the
CastingImpl
directly into methods on the DTypes. But this obscures the similarity between casting and universal functions. It does reduce indirection, as noted below.An earlier proposal defined two methods
__can_cast_to__(self, other)
to dynamically returnCastingImpl
. This removes the requirement to define all possible casts at DType creation (of one of the involved DTypes).Such an API could be added later. It resembles Python’s
__getattr__
in providing additional control over attribute lookup.
Notes:
CastingImpl
is used as a name in this NEP to clarify that it implements
all functionality related to a cast. It is meant to be identical to the
ArrayMethod
proposed in NEP 43 as part of restructuring ufuncs to handle
new DTypes. All type definitions are expected to be named ArrayMethod
.
The way dispatching works for CastingImpl
is planned to be limited
initially and fully opaque. In the future, it may or may not be moved into a
special UFunc, or behave more like a universal function.
Coercion to and from Python objects#
When storing a single value in an array or taking it out, it is necessary to coerce it – that is, convert it – to and from the low-level representation inside the array.
Coercion is slightly more complex than typical casts. One reason is that a Python object could itself be a 0-dimensional array or scalar with an associated DType.
Coercing to and from Python scalars requires two to three methods that largely correspond to the current definitions:
__dtype_setitem__(self, item_pointer, value)
__dtype_getitem__(self, item_pointer, base_obj) -> object
;base_obj
is for memory management and usually ignored; it points to an object owning the data. Its only role is to support structured datatypes with subarrays within NumPy, which currently return views into the array. The function returns an equivalent Python scalar (i.e. typically a NumPy scalar).__dtype_get_pyitem__(self, item_pointer, base_obj) -> object
(initially hidden for new-style user-defined datatypes, may be exposed on user request). This corresponds to thearr.item()
method also used byarr.tolist()
and returns Python floats, for example, instead of NumPy floats.
(The above is meant for C-API. A Python-side API would have to use byte buffers or similar to implement this, which may be useful for prototyping.)
When a certain scalar
has a known (different) dtype, NumPy may in the future use casting instead of
__dtype_setitem__
.
A user datatype is (initially) expected to implement
__dtype_setitem__
for its own DType.type
and all basic Python scalars
it wishes to support (e.g. int
and float
). In the future a
function known_scalar_type
may be made public to allow a user dtype to signal
which Python scalars it can store directly.
Implementation: The pseudocode implementation for setting a single item in
an array from an arbitrary Python object value
is (some
functions here are defined later):
def PyArray_Pack(dtype, item_pointer, value):
DType = type(dtype)
if DType.type is type(value) or DType.known_scalartype(type(value)):
return dtype.__dtype_setitem__(item_pointer, value)
# The dtype cannot handle the value, so try casting:
arr = np.array(value)
if arr.dtype is object or arr.ndim != 0:
# not a numpy or user scalar; try using the dtype after all:
return dtype.__dtype_setitem__(item_pointer, value)
arr.astype(dtype)
item_pointer.write(arr[()])
where the call to np.array()
represents the dtype discovery and is
not actually performed.
Example: Current datetime64
returns np.datetime64
scalars and can
be assigned from np.datetime64
. However, the datetime
__dtype_setitem__
also allows assignment from date strings (“2016-05-01”)
or Python integers. Additionally the datetime __dtype_get_pyitem__
function actually returns a Python datetime.datetime
object (most of the
time).
Alternatives: This functionality could also be implemented as a cast to and
from the object
dtype.
However, coercion is slightly more complex than typical casts.
One reason is that in general a Python object could itself be a
zero-dimensional array or scalar with an associated DType.
Such an object has a DType, and the correct cast to another DType is already
defined:
np.array(np.float32(4), dtype=object).astype(np.float64)
is identical to:
np.array(4, dtype=np.float32).astype(np.float64)
Implementing the first object
to np.float64
cast explicitly,
would require the user to take to duplicate or fall back to existing
casting functionality.
It is certainly possible to describe the coercion to and from Python objects
using the general casting machinery, but the object
dtype is special and
important enough to be handled by NumPy using the presented methods.
Further issues and discussion:
The
__dtype_setitem__
function duplicates some code, such as coercion from a string.datetime64
allows assignment from string, but the same conversion also occurs for casting from the string dtype todatetime64
.We may in the future expose the
known_scalartype
function to allow the user to implement such duplication.For example, NumPy would normally use
np.array(np.string_("2019")).astype(datetime64)
but
datetime64
could choose to use its__dtype_setitem__
instead for performance reasons.There is an issue about how subclasses of scalars should be handled. We anticipate to stop automatically detecting the dtype for
np.array(float64_subclass)
to be float64. The user can still providedtype=np.float64
. However, the above automatic casting usingnp.array(scalar_subclass).astype(requested_dtype)
will fail. In many cases, this is not an issue, since the Python__float__
protocol can be used instead. But in some cases, this will mean that subclasses of Python scalars will behave differently.
Note
Example: np.complex256
should not use __float__
in its
__dtype_setitem__
method in the future unless it is a known floating
point type. If the scalar is a subclass of a different high precision
floating point type (e.g. np.float128
) then this currently loses
precision without notifying the user.
In that case np.array(float128_subclass(3), dtype=np.complex256)
may fail unless the float128_subclass
is first converted to the
np.float128
base class.
DType discovery during array coercion#
An important step in the use of NumPy arrays is creation of the array from collections of generic Python objects.
Motivation: Although the distinction is not clear currently, there are two main needs:
np.array([1, 2, 3, 4.])
needs to guess the correct dtype based on the Python objects inside. Such an array may include a mix of datatypes, as long as they can be promoted. A second use case is when users provide the output DType class, but not the specific DType instance:
np.array([object(), None], dtype=np.dtype[np.string_]) # (or `dtype="S"`)
In this case the user indicates that object()
and None
should be
interpreted as strings.
The need to consider the user provided DType also arises for a future
Categorical
:
np.array([1, 2, 1, 1, 2], dtype=Categorical)
which must interpret the numbers as unique categorical values rather than integers.
There are three further issues to consider:
It may be desirable to create datatypes associated with normal Python scalars (such as
datetime.datetime
) that do not have adtype
attribute already.In general, a datatype could represent a sequence, however, NumPy currently assumes that sequences are always collections of elements (the sequence cannot be an element itself). An example would be a
vector
DType.An array may itself contain arrays with a specific dtype (even general Python objects). For example:
np.array([np.array(None, dtype=object)], dtype=np.String)
poses the issue of how to handle the included array.
Some of these difficulties arise because finding the correct shape of the output array and finding the correct datatype are closely related.
Implementation: There are two distinct cases above:
The user has provided no dtype information.
The user provided a DType class – as represented, for example, by
"S"
representing a string of any length.
In the first case, it is necessary to establish a mapping from the Python type(s) of the constituent elements to the DType class. Once the DType class is known, the correct dtype instance needs to be found. In the case of strings, this requires to find the string length.
These two cases shall be implemented by leveraging two pieces of information:
DType.type
: The current type attribute to indicate which Python scalar type is associated with the DType class (this is a class attribute that always exists for any datatype and is not limited to array coercion).__discover_descr_from_pyobject__(cls, obj) -> dtype
: A classmethod that returns the correct descriptor given the input object. Note that only parametric DTypes have to implement this. For nonparametric DTypes using the default instance will always be acceptable.
The Python scalar type which is already associated with a DType through the
DType.type
attribute maps from the DType to the Python scalar type.
At registration time, a DType may choose to allow automatically discover for
this Python scalar type.
This requires a lookup in the opposite direction, which will be implemented
using global a mapping (dictionary-like) of:
known_python_types[type] = DType
Correct garbage collection requires additional care.
If both the Python scalar type (pytype
) and DType
are created dynamically,
they will potentially be deleted again.
To allow this, it must be possible to make the above mapping weak.
This requires that the pytype
holds a reference of DType
explicitly.
Thus, in addition to building the global mapping, NumPy will store the DType
as
pytype.__associated_array_dtype__
in the Python type.
This does not define the mapping and should not be accessed directly.
In particular potential inheritance of the attribute does not mean that NumPy will use the
superclasses DType
automatically. A new DType
must be created for the
subclass.
Note
Python integers do not have a clear/concrete NumPy type associated right now. This is because during array coercion NumPy currently finds the first type capable of representing their value in the list of long, unsigned long, int64, unsigned int64, and object (on many machines long is 64 bit).
Instead they will need to be implemented using an AbstractPyInt
. This
DType class can then provide __discover_descr_from_pyobject__
and
return the actual dtype which is e.g. np.dtype("int64")
. For
dispatching/promotion in ufuncs, it will also be necessary to dynamically
create AbstractPyInt[value]
classes (creation can be cached), so that
they can provide the current value based promotion functionality provided
by np.result_type(python_integer, array)
[2] .
To allow for a DType to accept inputs as scalars that are not basic Python
types or instances of DType.type
, we use known_scalar_type
method.
This can allow discovery of a vector
as a scalar (element) instead of a sequence
(for the command np.array(vector, dtype=VectorDType)
) even when vector
is itself a
sequence or even an array subclass. This will not be public API initially,
but may be made public at a later time.
Example: The current datetime DType requires a
__discover_descr_from_pyobject__
which returns the correct unit for string
inputs. This allows it to support:
np.array(["2020-01-02", "2020-01-02 11:24"], dtype="M8")
By inspecting the date strings. Together with the common dtype operation, this allows it to automatically find that the datetime64 unit should be “minutes”.
NumPy internal implementation: The implementation to find the correct dtype will work similar to the following pseudocode:
def find_dtype(array_like):
common_dtype = None
for element in array_like:
# default to object dtype, if unknown
DType = known_python_types.get(type(element), np.dtype[object])
dtype = DType.__discover_descr_from_pyobject__(element)
if common_dtype is None:
common_dtype = dtype
else:
common_dtype = np.promote_types(common_dtype, dtype)
In practice, the input to np.array()
is a mix of sequences and array-like
objects, so that deciding what is an element requires to check whether it
is a sequence.
The full algorithm (without user provided dtypes) thus looks more like:
def find_dtype_recursive(array_like, dtype=None):
"""
Recursively find the dtype for a nested sequences (arrays are not
supported here).
"""
DType = known_python_types.get(type(element), None)
if DType is None and is_array_like(array_like):
# Code for a sequence, an array_like may have a DType we
# can use directly:
for element in array_like:
dtype = find_dtype_recursive(element, dtype=dtype)
return dtype
elif DType is None:
DType = np.dtype[object]
# dtype discovery and promotion as in `find_dtype` above
If the user provides DType
, then this DType will be tried first, and the
dtype
may need to be cast before the promotion is performed.
Limitations: The motivational point 3. of a nested array
np.array([np.array(None, dtype=object)], dtype=np.String)
is currently
(sometimes) supported by inspecting all elements of the nested array.
User DTypes will implicitly handle these correctly if the nested array
is of object
dtype.
In some other cases NumPy will retain backward compatibility for existing
functionality only.
NumPy uses such functionality to allow code such as:
>>> np.array([np.array(["2020-05-05"], dtype="S")], dtype=np.datetime64)
array([['2020-05-05']], dtype='datetime64[D]')
which discovers the datetime unit D
(days).
This possibility will not be accessible to user DTypes without an
intermediate cast to object
or a custom function.
The use of a global type map means that an error or warning has to be given if two DTypes wish to map to the same Python type. In most cases user DTypes should only be implemented for types defined within the same library to avoid the potential for conflicts. It will be the DType implementor’s responsibility to be careful about this and use avoid registration when in doubt.
Alternatives:
Instead of a global mapping, we could rely on the scalar attribute
scalar.__associated_array_dtype__
. This only creates a difference in behavior for subclasses, and the exact implementation can be undefined initially. Scalars will be expected to derive from a NumPy scalar. In principle NumPy could, for a time, still choose to rely on the attribute.An earlier proposal for the
dtype
discovery algorithm used a two-pass approach, first finding the correctDType
class and only then discovering the parametricdtype
instance. It was rejected as needlessly complex. But it would have enabled value-based promotion in universal functions, allowing:np.add(np.array([8], dtype="uint8"), [4])
to return a
uint8
result (instead ofint16
), which currently happens for:np.add(np.array([8], dtype="uint8"), 4)
(note the list
[4]
instead of scalar4
). This is not a feature NumPy currently has or desires to support.
Further issues and discussion: It is possible to create a DType
such as Categorical, array, or vector which can only be used if dtype=DType
is provided. Such DTypes cannot roundtrip correctly. For example:
np.array(np.array(1, dtype=Categorical)[()])
will result in an integer array. To get the original Categorical
array
dtype=Categorical
will need to be passed explicitly.
This is a general limitation, but round-tripping is always possible if
dtype=original_arr.dtype
is passed.
Public C-API#
DType creation#
To create a new DType the user will need to define the methods and attributes outlined in the Usage and impact section and detailed throughout this proposal.
In addition, some methods similar to those in PyArray_ArrFuncs
will
be needed for the slots struct below.
As mentioned in NEP 41, the interface to define this DType
class in C is modeled after PEP 384: Slots and some additional information
will be passed in a slots struct and identified by ssize_t
integers:
static struct PyArrayMethodDef slots[] = {
{NPY_dt_method, method_implementation},
...,
{0, NULL}
}
typedef struct{
PyTypeObject *typeobj; /* type of python scalar or NULL */
int flags /* flags, including parametric and abstract */
/* NULL terminated CastingImpl; is copied and references are stolen */
CastingImpl *castingimpls[];
PyType_Slot *slots;
PyTypeObject *baseclass; /* Baseclass or NULL */
} PyArrayDTypeMeta_Spec;
PyObject* PyArray_InitDTypeMetaFromSpec(PyArrayDTypeMeta_Spec *dtype_spec);
All of this is passed by copying.
TODO: The DType author should be able to define new methods for the
DType, up to defining a full object, and, in the future, possibly even
extending the PyArrayDTypeMeta_Type
struct. We have to decide what to make
available initially. A solution may be to allow inheriting only from an
existing class: class MyDType(np.dtype, MyBaseclass)
. If np.dtype
is
first in the method resolution order, this also prevents an undesirable
override of slots like ==
.
The slots
will be identified by names which are prefixed with NPY_dt_
and are:
is_canonical(self) -> {0, 1}
ensure_canonical(self) -> dtype
default_descr(self) -> dtype
(return must be native and should normally be a singleton)setitem(self, char *item_ptr, PyObject *value) -> {-1, 0}
getitem(self, char *item_ptr, PyObject (base_obj) -> object or NULL
discover_descr_from_pyobject(cls, PyObject) -> dtype or NULL
common_dtype(cls, other) -> DType, NotImplemented, or NULL
common_instance(self, other) -> dtype or NULL
Where possible, a default implementation will be provided if the slot is
omitted or set to NULL
. Nonparametric dtypes do not have to implement:
discover_descr_from_pyobject
(usesdefault_descr
instead)common_instance
(usesdefault_descr
instead)ensure_canonical
(usesdefault_descr
instead).
Sorting is expected to be implemented using:
get_sort_function(self, NPY_SORTKIND sort_kind) -> {out_sortfunction, NotImplemented, NULL}
.
For convenience, it will be sufficient if the user implements only:
compare(self, char *item_ptr1, char *item_ptr2, int *res) -> {-1, 0, 1}
Limitations: The PyArrayDTypeMeta_Spec
struct is clumsy to extend (for
instance, by adding a version tag to the slots
to indicate a new, longer
version). We could use a function to provide the struct; it would require
memory management but would allow ABI-compatible extension (the struct is
freed again when the DType is created).
CastingImpl#
The external API for CastingImpl
will be limited initially to defining:
casting
attribute, which can be one of the supported casting kinds. This is the safest cast possible. For example, casting between two NumPy strings is of course “safe” in general, but may be “same kind” in a specific instance if the second string is shorter. If neither type is parametric theresolve_descriptors
must use it.resolve_descriptors(PyArrayMethodObject *self, PyArray_DTypeMeta *DTypes[2], PyArray_Descr *dtypes_in[2], PyArray_Descr *dtypes_out[2], NPY_CASTING *casting_out) -> int {0, -1}
The out dtypes must be set correctly to dtypes which the strided loop (transfer function) can handle. Initially the result must have instances of the same DType class as theCastingImpl
is defined for. Thecasting
will be set toNPY_EQUIV_CASTING
,NPY_SAFE_CASTING
,NPY_UNSAFE_CASTING
, orNPY_SAME_KIND_CASTING
. A new, additional flag,_NPY_CAST_IS_VIEW
, can be set to indicate that no cast is necessary and a view is sufficient to perform the cast. The cast should return-1
when an error occurred. If a cast is not possible (but no error occurred), a-1
result should be returned without an error set. This point is under consideration, we may use ``-1`` to indicate a general error, and use a different return value for an impossible cast. This means that it is not possible to inform the user about why a cast is impossible.strided_loop(char **args, npy_intp *dimensions, npy_intp *strides, ...) -> int {0, -1}
(signature will be fully defined in NEP 43)
This is identical to the proposed API for ufuncs. The additional ...
part of the signature will include information such as the two dtype
s.
More optimized loops are in use internally, and
will be made available to users in the future (see notes).
Although verbose, the API will mimic the one for creating a new DType:
typedef struct{
int flags; /* e.g. whether the cast requires the API */
int nin, nout; /* Number of Input and outputs (always 1) */
NPY_CASTING casting; /* The "minimal casting level" */
PyArray_DTypeMeta *dtypes; /* input and output DType class */
/* NULL terminated slots defining the methods */
PyType_Slot *slots;
} PyArrayMethod_Spec;
The focus differs between casting and general ufuncs. For example, for casts
nin == nout == 1
is always correct, while for ufuncs casting
is
expected to be usually “no”.
Notes: We may initially allow users to define only a single loop. Internally NumPy optimizes far more, and this should be made public incrementally in one of two ways:
Allow multiple versions, such as:
contiguous inner loop
strided inner loop
scalar inner loop
Or, more likely, expose the
get_loop
function which is passed additional information, such as the fixed strides (similar to our internal API).The casting level denotes the minimal guaranteed casting level and can be
-1
if the cast may be impossible. For most non-parametric casts, this value will be the casting level. NumPy may skip theresolve_descriptors
call fornp.can_cast()
when the result isTrue
based on this level.
The example does not yet include setup and error handling. Since these are similar to the UFunc machinery, they will be defined in NEP 43 and then incorporated identically into casting.
The slots/methods used will be prefixed with NPY_meth_
.
Alternatives:
Aside from name changes and signature tweaks, there seem to be few alternatives to the above structure. The proposed API using
*_FromSpec
function is a good way to achieve a stable and extensible API. The slots design is extensible and can be changed without breaking binary compatibility. Convenience functions can still be provided to allow creation with less code.One downside is that compilers cannot warn about function-pointer incompatibilities.
Implementation#
Steps for implementation are outlined in the Implementation section of NEP 41. In brief, we first will rewrite the internals of casting and array coercion. After that, the new public API will be added incrementally. We plan to expose it in a preliminary state initially to gain experience. All functionality currently implemented on the dtypes will be replaced systematically as new features are added.
Alternatives#
The space of possible implementations is large, so there have been many discussions, conceptions, and design documents. These are listed in NEP 40. Alternatives were also been discussed in the relevant sections above.
References#
Copyright#
This document has been placed in the public domain.