Hey smart people, I have a question about a Map Re...
# suitescript
j
Hey smart people, I have a question about a Map Reduce. I have a list of about 70,000 things that I need to compare nightly with another list of about 1000 things, using some fuzzy string matching logic. The second list (of 1000 things) is a record type that has a couple different fields to compare against the bigger list. I’m thinking of using a MR to go through each entry in the big list, but would rather only retrieve the smaller list once (rather than retrieve it from database each time for each of the 70,000 things). Is there a way to do this, get that smaller list once and just pass it around the MR? I suppose I could dump it into every entry of the key-value pairs but that seems….extreme.
a
you could maybe use the cache module. I've never used it for anything like a list of 1000 things so I'm not sure what limitations it might have, but that's the first thing I'd look into.
👍 1
most of my other suggestions would be around make the list sizes smaller 🙂 which isn't what you asked for, and I don't know enough about your data to know how viable that would be, presumably you've thought of this already.
e
Can't we use a global variable in the map reduce script?
a
nope, if you have a global it will only be available in the GIS and the summary stages, not the map or reduce stages
could you use a script param? get the list of ~1000 data in a scheduled script or another MR, and then call your main MR with the stringified data as a script param?
oh and your initial "extreme" solution, I think is totally fine. I've done similar before, but again never that size I don' think. so not sure at what point you'll run into memory issues
also... why not just do 70k db reads? reads are cheap compared to writes, i know it seems inefficient but if this is an overnight batch process, not sure that would be an issue... actually if the 1000 list can be gotten from search, you could create the search in the UI, I think NS automatically optimizes those searches based on usage so yours would likely qualify 😄
j
I’m doing all the searches w/SuiteQL
the 70k isn’t actually 70k records, it’s a UNION of three different SELECT DISTINCTs that has about 70k results
I’m thinking
N/cache
might be the way to go, though this would be my first try with that module. If I can cache my 1000 list at the start that should work, I think.
👍 1
a
its been a while, but if i remember rightly there's limits on the # of keys you have in a cache, but you can do something really dumb like nest it, and you can have as many as you want... something goofy like that i remeber, NS may have "fixed" it
t
make sure to also use the loader function when getting the cache so that it would repopulate the cache if ever it returns null. Also take note of the max size limit(500kb) of the cache you are putting.
j
Thanks guys
w
My experience is that a single map instance shares the global variables with each iteration until it yields. Load the list in the beginning of the Map stage and it can probably be used many times until it yields for out of governance points/time. Regarding fuzzy search, we've been using string similarity with good results.
a
...if you're loading the list at the beginning of the map stage, and there's 70,000 map stages, you're going to load the list 70,000 times?
w
no
One map instance will run multiple key-value pairs.
a
how do i write code to execute in the map instance, but not for each key-value pair?
w
We have this function in a map/reduce.
fetchEmployees()
is called in the beginning of the map-function and then we access the EMPLOYEE_LIST throughout the map-instance.
Copy code
const EMPLOYEE_LIST = {
    fetched: false,
    employees : [],
}
const fetchEmployees = () => {
    if(!EMPLOYEE_LIST.fetched) {
        search.create({
            type: search.Type.EMPLOYEE,
            filters:[ ['isinactive', <http://search.Operator.IS|search.Operator.IS>, 'F'] ],
            columns: [ EMPLOYEE.FIELDS.ENTITYID ],
        }).run().each(result => {
            const entityId = result.getValue(EMPLOYEE.FIELDS.ENTITYID) as string
            EMPLOYEE_LIST.employees.push({
                id: result.id,
                [EMPLOYEE.FIELDS.ENTITYID]: entityId.toLowerCase(),
            })
            return true
        })
        EMPLOYEE_LIST.fetched = true
    }
}
a
so your
EMPLOYEE_LIST
declaration is outside the map? and when a new instance is created it will reset the global to empty?
w
Yes
Technically, I'm thinking that when a new instance is created, the global doesn't exist in that context, so it's not reset per-se.
a
right, what I mean is, the declaration will be ran again in the instance...
w
Yes, once per instance.
a
good to know, presumably that works in the reduce stage too
w
I'd assume so