Why do people in ML put imports inside function definitions?

moyix · on Feb 9, 2023

One reason is that some ML libraries are really slow to import, so you don't want to put them at top-level unless you definitely need them. E.g. if I had just one function that needed to use a tokenizer from the Transformers library, I wouldn't want to eat a 2 second startup cost every time:

    In [1]: %time import transformers
    CPU times: user 3.21 s, sys: 7.8 s, total: 11 s
    Wall time: 1.91 s

eddsh1994 · on Feb 9, 2023

I didn't think about lazy loading, I also didn't know they were scoped differently! I thought it was some sort of organisation to keep imports close to usage. Thanks!

int_19h · on Feb 9, 2023

The scoping also has some performance advantage: locals are accessed by index in the bytecode, with all name resolution happening at compile-time, but globals require a string lookup in the module dictionary every time they're accessed.

This isn't something that should matter even a little in typical ML code. But in generic Python libraries, there are cases when this kind of micro-optimization can help. Similar tricks include turning methods into pre-bound attributes in __init__ to skip all the descriptor machinery on every call.

_cs2017_ · on Feb 10, 2023

Curious, in what cases might this help? The compute would have to be python-bound (not C library-bound); and the frequency of module lookups would have to be in the ballpark of other dictionary lookups. I wonder if cases like this exist in the real world.

int_19h · on Feb 10, 2023

The case where I've seen those tricks used with measurable effect was a Python debugger - specifically, its implementation of the sys.settrace callback. Since that gets executed on every line of code, every little bit helps.

(It's much faster if you implement the callback in native code, but then that doesn't work on IronPython, Jython etc.)

cuteboy19 · on Feb 9, 2023

They are also scoped differently, but the lazy loading is the key thing here

jaykmody · on Feb 9, 2023

Author here. It's a design choice, but there's two reasons I chose to use imports like this:

1) For demonstrative purposes. The title of the post is `A GPT in 60 Lines of NumPy`, I kinda wanted to show "hey it's just numpy, nothing to be scared about!". Also if an import is ONLY used in a single function, I find it visually helps show that "hey, this import is only used in this function" vs when it's at the top of the file you're not really sure when/where and how many times an import is used.

2) Scoping. `load_encoder_hparams_and_params` imports tensorflow, which is really slow to import. When I was testing, I used randomly initialized weights instead of loading the checkpoint which is slower, so I was only making use of the `gpt2` function. If I kept the import at the top level, it would've slowed things down unnecessarily.

w0m · on Feb 9, 2023

I do sometimes - just depends on the context and how often the function(xor library) is going to get called.

Here - they put `import fire` only in the `if __name__ == "__main__":` - that seems reasonable to me as anyone pulling in the library from elsewhere doesn't need the pollution.

lizard · on Feb 9, 2023

Right, I do this with argparse for creating simple CLIs for a module generally intended to be imported and used in another program. argparse has nothing to do with the actual module functions and won't be needed if the module if going to be used in a web app or some other context.

This make even more sense for a non-standard library like fire because you won't even need this dependency if you're going to import the module and write your own interface instead.

The import in main doesn't seem particularly useful in context on a quick read, but considering the line

> utils.py contains the code to download and load the GPT-2 model weights, tokenizer, and hyper-parameters.

it seems possible some downloads are happening on import so does make sense to defer until actually needed, as suggested in sibling comments.

theptip · on Feb 9, 2023

Does that import have side effects? Are we really worried about adding an entry to the imports dict if not? Or put differently, what cases do we actually get a negative effect from just importing at the top?

apetresc · on Feb 9, 2023

Oh yeah, imports in Python are not just, like, extending a namespace like in many other languages. They, at runtime, go and run the module's __init__ and can have arbitrary side effects - an entire program can run (although usually shouldn't) just in the import. Imports of large modules often take entire seconds.

It is absolutely worthwhile to avoid unnecessary imports if possible.

theptip · on Feb 9, 2023

I know they _can_ have side-effects, I’ve just never seen a case where it actually mattered, and I have used Python professionally for 10 years. So I’m curious if this is more common in ML libraries or something.

apetresc · on Feb 9, 2023

I guess it depends on your definition of "side-effects" but it definitely comes up in common ML packages. For one example, importing `torch` often takes seconds, or tens of seconds, because the import itself needs to determine if it should come up in CUDA mode or not, how many OpenCL/GPU devices you have, how they're configured, etc.

theptip · on Feb 9, 2023

Gotcha, thanks, that makes perfect sense then.

heretoo · on Feb 10, 2023

It wouldn't surprise me if the original reason is the pervasive use of jupyter notebooks in ML, which don't adhere to normal python conventions, and are affected by slow imports only when those sections are explicitly evaluated.

Side-effects in imports are, in my opinion, unnecessary, losing some of the benefits of static analysis, running with different parameters during tests, compiling to native code (if those tools exist), slowing things down, and more.

Libraries could have an initializer function and the problem would go away.

kelnos · on Feb 9, 2023

Importing another module takes non-zero time and uses non-zero memory, and let's face it: python is not exactly a fast language. Personally I'd appreciate a library author that takes steps to avoid a module load when that module is only used (for example) in some uncommonly-taken code paths.

In some (many?) cases it's probably premature optimization, but it doesn't hurt, so I don't see why anyone would get up in arms over it.

Kichererbsen · on Feb 9, 2023

importing is a _runtime_ operation: unless previously imported, the interpreter will go and import that module, executing that modules code. that can take a while. it will also bind a name in the current scope to the modules name, so... that might be considered pollution?

time_to_smile · on Feb 9, 2023

I'm in ML and I would also like an answer to this question.

I've seen a lot of Python people sprinkle imports all over the place in their code. I suspect this is a bad habit learned from too much time working in notebooks where you often have an "oh right, I need XXX library now" and just import it as you need it.

The aggressive aliasing I do get since in DS/ML work it's very common to have the same function do slightly different things depending on the library (standard deviation between numpy and pandas is a good example)

But I personally like all of my imports at the top so I know what this code I'm about to read is going to be doing. I do seem to be in the minority in this (and would be glad to be correct if I'm make some major error).

matsemann · on Feb 9, 2023

I often end up having to inline imports, because python doesn't support circular imports.

Of course, "don't do circular imports". But if my Orders model has OrderLines, and my OrderLines points to their Order, it's damn hard to avoid without putting everything in one huge file..

jwilber · on Feb 9, 2023

Almost every tech company will have some sort of commit hook using isort to force correct import ordering at the top of the file.

claytonjy · on Feb 9, 2023

Ha, if only! I've been the one to introduce this at the last three jobs I've had, two of which had hundreds of engineers and plenty of python code before I got there.

"Best practices" are incredibly unevenly distributed, and I suspect this is only more true for data/ML-heavy python code.

clawlor · on Feb 9, 2023

New (v5) isort doesn't move imports to the top of the file anymore, at least not by default. There is a flag to retain the old behavior, but even then I don't think it will move imports from, say, inside a function body to the top of the module.

reallymental · on Feb 9, 2023

Scope-dependent imports. What if a package is just required for that particular function, and once that function is done, the imported package is no longer required?

codethief · on Feb 9, 2023

Another reason (besides the ones already mentioned in the other comments) is that some imports might only be available on certain operating systems or architectures. I once wrote heavily optimized ML code for Nvidia Jetson Nano devices but I still wanted to be able to test the overall application (the non-Nvidia-specific code) on my laptop or in pipelines.

z3t4 · on Feb 9, 2023

Why not? It helps limiting variable scope. The advantage of a global variable without the disadvantages.

cuteboy19 · on Feb 9, 2023

Why not- possible to get an importerror if you make a mistake in the import statement. This kind of error should happen as early as possible and you won't expect it to happen during a random function call

kelnos · on Feb 9, 2023

If you're writing in a dynamically-typed, interpreted language like python, I think mistyping an import inside a function is really the least of your concerns when it comes to mistyping things.

cuteboy19 · on Feb 10, 2023

It's also possible that the library isn't installed

junon · on Feb 9, 2023

Lazy loading, avoiding pollution of symbols in the root scope, avoiding re-exports of symbols in the root scope, self-documenting code ("this function uses these libraries"), portable coding (sometimes desirable), etc.

sega_sai · on Feb 9, 2023

At least one reason to do that is to allow optional module dependencies.

ok123456 · on Feb 10, 2023

It's a problem with Python in general:

1) Circular dependencies (and you don't want your house of cards falling down if your IDE/isort decides to reorder a few things); 2) (slow/expensive) expressions that are evaluated on import; 3) startup time required for the module loader to resolve everything at start.

joxel · on Feb 9, 2023

I'll have to do imports to change backends for matplotlib sometimes