FUTURE plans for GraphBLAS:

    JIT package: don't just check 1st line of GraphBLAS.h when deciding to
        unpack the src in user cache folder. Use a crc test.

    cumulative sum (or other monoid)

    Raye: link-time optimization with binary for operators, for Julia

    pack/unpack COO

    kernel fusion

    CUDA kernels
        CUDA: finding src
        CUDA: kernel source location, and name

    distributed framework

    fine-grain parallelism for dot-product based mxm, mxv, vxm,
        then add GxB_vxvt (outer product) and GxB_vtxv (inner product)
        (or call them GxB_outerProduct and GxB_innerProduct?)

    aggregators

    GrB_extract with GrB_Vectors instead of (GrB_Index *) arrays for I and J

    iso: set a flag with GrB_get/set to disable iso.  useful if the matrix is
    about to become non-iso anyway. Pagerank does:

        r = teleport (becomes iso)
        r += A*x     (becomes non-iso)

    apply: C = f(A), A dense, no mask or accum, C already dense: do in place

    JIT: allow a flag to be set in a type or operator to selectively control
        the JIT

    JIT: requires GxB_BinaryOp_new to give the string that defines the op.
    Allow use of
        GrB_BinaryOp_new (...)
        GrB_set (op, GxB_DEFN, "string")
    also for all ops

    candidates for kernel fusion:
        * triangle counting: mxm then reduce to scalar
        * lcc: mxm then reduce to vector
        * FusedMM: see https://arxiv.org/pdf/2011.06391.pdf

    more:
        * consider algorithms where fusion can occur
        * performance monitor, or revised burble, to detect generic cases
        * check if vectorization of GrB_mxm is effective when using clang
        * see how HNSW vector search could be implemented in GraphBLAS

integer sizes:
        use 32-bit or 64-bit integers in A->i, A->h, and perhaps A->p.
        >>> requires generic pack/unpack method for O(1) move constructors.
        could use new build methods (int32 indices), or GrB_Vectors for I and J.
        does not require new extracTuples method (simply typecast the copy).
        could use new GrB_*import/export methods, with int32 indices, or just
            typecast on copy.

first step:
    A->i and A->h have the same size (both 32, or both 64),
    determined by max(nrows,ncols) <= 2^30.  A->p always 64-bit.
    In the GrB_Matrix, A->i and A->h become void *, not int64_t *.
    This gives 2 types of matrices.

        Ai = A->i ;
        Ah = A->h ;
        Ai [p] = i
        Ah [k] = j
        i = Ai [p] ;
        j = Ah [k] ;

    replaced with

        GBI_GET_PTR (A) ;               // new macro
        GBH_GET_PTR (A) ;               // new macro
        GBI_SET (Ai, p, k)              // new macro
        GBH_SET (Ah, k, j)              // new macro
        i = GHI_A (Ai, pA, avlen) ;     // no change to use of macro
        j = GHH_A (Ah, pA, avlen) ;     // no change to use of macro

    where

        #define GPI_GET_PTR(A) \
            A ## i32 = (A->is_32) ? (int32_t *) A->i : NULL ;
            A ## i64 = (A->is_32) ? NULL : (int64_t *) A->i ;

        #define GPH_GET_PTR(A) \
            A ## h32 = (A->is_32) ? (int32_t *) A->h : NULL ;
            A ## h64 = (A->is_32) ? NULL : (int64_t *) A->h ;

        #define GBI_SET (Ai, p, i) \
            if (Ai ## 32 == NULL)
            {   
                Ai ## 64 [p] = (i) ;
            }
            else 
            {
                Ai ## 32 [p] = (i) ;
            }

        #define GBH_SET (Ah, k, j) \
            if (Ah ## 32 == NULL)
            {   
                Ah ## 64 [k] = (j) ;
            }
            else 
            {
                Ah ## 32 [k] = (j) ;
            }

// handles the case where Ai is 32 bit, 64 bit, or NULL for full matrix:
#define GBI (Ai, p, vlen)
    (Ai ## 32 == NULL) ? ((Ai ## 64 == NULL) ? ((p) % vlen) : Ai ## 64 [p])
        : (Ai ## 32 [p]) ;

// use the same for GBH

All JIT kernels can be specialized to the int types of its matrices.
FactoryKernels and generic kernels cannot; they would need to use the above
macros.  Or, current single-integer codes with no specializations would use:

    if (all matrices are 32 bit)
        use all 32 bit Ai and Ah for all matrices
    else if all 64
        use all 64 bit Ai and Ah for all matrices
    else
        use generic macros as above

The code inside each of the 3 cases above would be templatized methods.

memcpy (Ai, Ci, ...) would be replaced with a function that did all 4
variants:  both 32, 32 to 64, 64 to 32, and both 64.

