Running Batch Jobs #38

kaufmann42 · 2016-05-24T21:03:39Z

Hi I'm trying to run a batch job utilizing the --array slurm option. Wondering if this is possible using drmaa-python. I know there is a runBulkJobs(...), however it seems that this doesn't run an array of jobs. There doesn't seem to be any $SLURM_ARRAY_TASK_ID (or the likes) associated with that run environment.

with drmaa.Session() as s:
                try:
                    # create job template
                    jt = s.createJobTemplate()
                    jt.nativeSpecification='--mem-per-cpu='+self.memory + ' --array=1-3' + ' --time=' + self.time
                    jt.remoteCommand = command
                    print(jt.nativeSpecification)

                    # run job
                    joblist = s.runBulkJobs(jt, 1, 3, 1)

                    # wait for the return value
                    s.synchronize(joblist, self.convertToSeconds(), False)

                    for curjob in joblist:
                        print('Collecting job ' + curjob)
                        retval = s.wait(curjob, drmaa.Session.TIMEOUT_WAIT_FOREVER)
                        print('Job: {0} finished with status {1} and was aborted {2}'.format(retval.jobId, retval.exitStatus, retval.wasAborted))
                        if retval.wasAborted == True:
                            print("Ran out of memeory using: " + self.memory)
                            self.increaseMemory(6000)
                        # if zero exit code then break and job is over
                        elif retval.exitStatus == 0:
                            break
                except drmaa.ExitTimeoutException:
                    print("Ran out of time using: " + self.time)
                    self.increaseTime(6)
                except drmaa.OutOfMemoryException:
                    print("Ran out of memeory using: " + self.memory)
                    self.increaseMemory(6000)

when I try and run this I get a segmentation fault.

OUTPUT

(gdb) run run.py "echo $SLURM_ARRAY_TASK_ID" 01:00:00 100
Starting program: /apps/python/2.7.6/bin/python run.py "echo $SLURM_ARRAY_TASK_ID" 01:00:00 100
[Thread debugging using libthread_db enabled]
--mem-per-cpu=100 -a=1-3 --time=01:00:00

Program received signal SIGSEGV, Segmentation fault.
drmaa_release_job_ids (values=0x0) at drmaa_base.c:297
297 drmaa_base.c: No such file or directory.
    in drmaa_base.c
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.107.el6.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.10.3-10.el6_4.6.x86_64 libcom_err-1.41.12-18.el6.x86_64 libselinux-2.0.94-5.3.el6.x86_64 openssl-1.0.1e-16.el6_5.14.x86_64
(gdb) backtrace
#0  drmaa_release_job_ids (values=0x0) at drmaa_base.c:297
#1  0x00002aaab15cecbc in ffi_call_unix64 () at /apps/python/Python-2.7.6/Modules/_ctypes/libffi/src/x86/unix64.S:76
#2  0x00002aaab15ce393 in ffi_call (cif=<value optimized out>, fn=0x2aaab301b290 <drmaa_release_job_ids>, rvalue=<value optimized out>, avalue=0x7fffffffb0d0) at /apps/python/Python-2.7.6/Modules/_ctypes/libffi/src/x86/ffi64.c:522
#3  0x00002aaab15c6006 in _call_function_pointer (pProc=0x2aaab301b290 <drmaa_release_job_ids>, argtuple=0x7fffffffb1a0, flags=4353, argtypes=<value optimized out>, restype=0x2aaaaae5ebd0, checker=0x0) at /apps/python/Python-2.7.6/Modules/_ctypes/callproc.c:836
#4  _ctypes_callproc (pProc=0x2aaab301b290 <drmaa_release_job_ids>, argtuple=0x7fffffffb1a0, flags=4353, argtypes=<value optimized out>, restype=0x2aaaaae5ebd0, checker=0x0) at /apps/python/Python-2.7.6/Modules/_ctypes/callproc.c:1183
#5  0x00002aaab15bdcf3 in PyCFuncPtr_call (self=<value optimized out>, inargs=<value optimized out>, kwds=0x0) at /apps/python/Python-2.7.6/Modules/_ctypes/_ctypes.c:3929
#6  0x00002aaaaaaf79b3 in PyObject_Call (func=0x93b530, arg=<value optimized out>, kw=<value optimized out>) at Objects/abstract.c:2529
#7  0x00002aaaaaba6ad9 in do_call (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4239
#8  call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4044
#9  PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#10 0x00002aaaaab1b887 in gen_send_ex (gen=0x971e60, arg=0x0, exc=<value optimized out>) at Objects/genobject.c:84
#11 0x00002aaaaab2f656 in listextend (self=0x981200, b=<value optimized out>) at Objects/listobject.c:872
#12 0x00002aaaaab2fae0 in list_init (self=0x981200, args=<value optimized out>, kw=<value optimized out>) at Objects/listobject.c:2458
#13 0x00002aaaaab5a8a8 in type_call (type=<value optimized out>, args=0x97bf90, kwds=0x0) at Objects/typeobject.c:745
#14 0x00002aaaaaaf79b3 in PyObject_Call (func=0x2aaaaae5b0a0, arg=<value optimized out>, kw=<value optimized out>) at Objects/abstract.c:2529
#15 0x00002aaaaaba6ad9 in do_call (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4239
#16 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4044
#17 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#18 0x00002aaaaaba917e in PyEval_EvalCodeEx (co=0x78f4b0, globals=<value optimized out>, locals=<value optimized out>, args=<value optimized out>, argcount=4, kws=0x9addb8, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:3253
#19 0x00002aaaaaba7332 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4117
#20 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#21 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#22 0x00002aaaaaba807e in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4107
#23 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#24 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#25 0x00002aaaaaba807e in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4107
#26 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#27 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#28 0x00002aaaaaba917e in PyEval_EvalCodeEx (co=0x7743b0, globals=<value optimized out>, locals=<value optimized out>, args=<value optimized out>, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:3253
#29 0x00002aaaaaba9292 in PyEval_EvalCode (co=<value optimized out>, globals=<value optimized out>, locals=<value optimized out>) at Python/ceval.c:667
#30 0x00002aaaaabc8e40 in run_mod (fp=0x82a030, filename=<value optimized out>, start=<value optimized out>, globals=0x67e510, locals=0x67e510, closeit=1, flags=0x7fffffffbe20) at Python/pythonrun.c:1370
#31 PyRun_FileExFlags (fp=0x82a030, filename=<value optimized out>, start=<value optimized out>, globals=0x67e510, locals=0x67e510, closeit=1, flags=0x7fffffffbe20) at Python/pythonrun.c:1356
#32 0x00002aaaaabc901f in PyRun_SimpleFileExFlags (fp=0x82a030, filename=0x7fffffffc4bd "run.py", closeit=1, flags=0x7fffffffbe20) at Python/pythonrun.c:948
#33 0x00002aaaaabdeb34 in Py_Main (argc=<value optimized out>, argv=<value optimized out>) at Modules/main.c:640
#34 0x00000039ad21ecdd in __libc_start_main () from /lib64/libc.so.6
#35 0x0000000000400669 in _start ()

gdb debug backtrace gives the following result

aside: I'm also having trouble with it throwing a OutOfMemoryException.. Therefore am forced to assume it was aborted due to memory (not preferable) so advice on what's happening there would be great.

Thanks!

The text was updated successfully, but these errors were encountered:

unode · 2017-03-05T18:00:19Z

I also see the segfault on SLURM. Possibly related to this.

unode · 2017-07-03T23:14:55Z

TL;DR: In my case this is no longer a problem with drmaa-python.

Tried to debug the code to figure out which call triggered the segfault.
At the moment I'm not sure that what I'm seeing is the same error reported above.

The problem seems to be in the call to drmaa_get_next_job_id.
The job is submitted and queued, it reaches the while loops, loops once and a jid is collected. Printing jid returns the same id as the submitted job. Yet as soon as the while condition is evaluated a second time it segfaults.

Afterwards I tried compiling the latest libdrmaa for SLURM and with it I could no longer reproduce the segfault.
This is also true even when compiling manually the same libdrmaa.so version provided system-wide.

It seems like this is an upstream problem and only seems to affect some releases when a certain combination of conditions is met.

The code used to debug this issue:

#!/usr/bin/env python
from __future__ import print_function
import os
import drmaa

LOGS = "logs/"
if not os.path.isdir(LOGS):
    os.mkdir(LOGS)

s = drmaa.Session()
s.initialize()
print("Supported contact strings:", s.contact)
print("Supported DRM systems:", s.drmsInfo)
print("Supported DRMAA implementations:", s.drmaaImplementation)
print("Version", s.version)

jt = s.createJobTemplate()
jt.remoteCommand = "/usr/bin/echo"
jt.args = ["Hello", "world"]
jt.jobName = "testdrmaa"
jt.jobEnvironment = os.environ.copy()
jt.workingDirectory = os.getcwd()

jt.outputPath = ":" + os.path.join(LOGS, "job-%A_%a.out")
jt.errorPath = ":" + os.path.join(LOGS, "job-%A_%a.err")

jt.nativeSpecification = "--ntasks=2 --mem-per-cpu=50 --partition=1day"

print("Submitting", jt.remoteCommand, "with", jt.args, "and logs to", jt.outputPath)
ids = s.runBulkJobs(jt, beginIndex=1, endIndex=10, step=1)
print("Job submitted with ids", ids)

s.deleteJobTemplate(jt)

On a malfunctioning system the message "Submitting..." would be shown and immediately after "Segmentation fault" . When the system is working properly you should instead see "Job submitted with...".

unode · 2017-08-19T01:51:02Z

For anyone that lands on this post after dealing with segmentation fault errors on SLURM, you might want to ask your cluster administrator to use libdrmaa.so from https://github.com/FredHutch/slurm-drmaa.

It's far from perfect and will still segfault if there are options in the nativeSpecification string that it cannot parse. Regardless I've managed to workaround most of the issues mentioned above with this version.

jakirkham · 2018-03-12T14:58:30Z

Perhaps worth adding a note in the docs and closing as this is a DRMAA implementation issue.

unode · 2018-05-22T06:00:14Z

Just adding that natefoo/slurm-drmaa@7b5991e solves this issue.
I agree with @jakirkham.

unode mentioned this issue Jul 2, 2017

Maintenance status #41

Open

dan-blanchard mentioned this issue Jul 3, 2017

Fix issue where string_vector dies when given an int #57

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running Batch Jobs #38

Running Batch Jobs #38

kaufmann42 commented May 24, 2016 •

edited

Loading

unode commented Mar 5, 2017

unode commented Jul 3, 2017 •

edited

Loading

unode commented Aug 19, 2017

jakirkham commented Mar 12, 2018

unode commented May 22, 2018

Running Batch Jobs #38

Running Batch Jobs #38

Comments

kaufmann42 commented May 24, 2016 • edited Loading

unode commented Mar 5, 2017

unode commented Jul 3, 2017 • edited Loading

unode commented Aug 19, 2017

jakirkham commented Mar 12, 2018

unode commented May 22, 2018

kaufmann42 commented May 24, 2016 •

edited

Loading

unode commented Jul 3, 2017 •

edited

Loading