Thursday, January 15, 2015

Importing Data into MongoDB with Python

Environment


  1. Python 2.7.6
    Installing the Python Driver for MongoDB
  2. MongoDB 2.6.7
    Installing MongoDB on Ubuntu 14.04
  3. Ubuntu 14.04



JSON Support for Python


Official Documentation: Simplejson is a simple, fast, complete, correct and extensible JSON encoder and decoder for Python 2.5+ and Python 3.3+. It is pure Python code with no dependencies, but includes an optional C extension for a serious speed boost.

Install simple-json using pip:
sudo pip install simple-json



Writing to MongoDB


 # -*- coding: utf-8 -*-
   
 import argparse  
 import datetime  
 import pprint  
 import pymongo  
 import json  
 import os  
 import sys  
 import fnmatch  
   
 ##      ARGPARSE USAGE  
 ##     <https://docs.python.org/2/howto/argparse.html>  
 parser = argparse.ArgumentParser(description="Import records into MongoDB")  
 group = parser.add_mutually_exclusive_group()  
 group.add_argument("-v", "--verbose", action="store_true")  
 group.add_argument("-q", "--quiet", action="store_true")  
 parser.add_argument("max", type=int, help="the maximum records to import", default=sys.maxint)  
 parser.add_argument("path", help="The input path for importing. This can be either a file or directory.")  
 parser.add_argument("db", help="The MongoDB name to import into.")  
 parser.add_argument("collection", help="The MongoDB collection to import into.")  
 args = parser.parse_args()
   
 ##      RETRIEVE files from filesystem
 def getfiles(path) :  
      if len(path) <= 1 :   
           print "!Please Supply an Input File"  
           return []  
      try :  
           input_path = str(path).strip()  
   
           if os.path.exists(input_path) == 0 :   
                print "!Input Path does not exist (input_path = ", input_path, ")"  
                return []  
   
           if os.path.isdir(input_path) == 0 :  
                if args.verbose :  
                     print "*Input Path is Valid (input_path = ", input_path, ")"  
                return [input_path]       
   
           matches = []  
           for root, dirnames, filenames in os.walk(input_path):  
                for filename in fnmatch.filter(filenames, '*.json'):  
                     matches.append(os.path.join(root, filename))  
             
           if len(matches) > 0 :  
                if args.verbose :  
                     print "*Found Files in Path (input_path = ", input_path, ", total-files = ", len(matches), ")"  
                return matches  
   
           print "!No Files Found in Path (input_path = ", input_path, ")"  
      except ValueError :  
           print "!Invalid Input (input_path, ", input_path, ")"  
      return []  
   
 ##     IMPORT records into mongo
 def read(jsonFiles) :  
      from pymongo import MongoClient  
   
      client = MongoClient('mongodb://localhost:27017/')  
      db = client[args.db]  
   
      counter = 0  
      for jsonFile in jsonFiles :  
           with open(jsonFile, 'r') as f:  
                for line in f:  
   
                     # load valid lines (should probably use rstrip)
                     if len(line) < 10 : continue  
                     try:  
                          db[args.collection].insert(json.loads(line))  
                          counter += 1  
                     except pymongo.errors.DuplicateKeyError as dke:  
                          if args.verbose :  
                               print "Duplicate Key Error: ", dke  
                     except ValueError as e:  
                          if args.verbose :  
                               print "Value Error: ", e  
   
                     # friendly log message                      
                     if 0 == counter % 100 and 0 != counter and args.verbose : print "loaded line: ", counter  
                     if counter >= args.max :   
                          break  
   
      f.close  
      db.close  
   
      if 0 == counter :  
           print "Warning: No Records were Loaded"  
      else :  
           print "loaded a total of ", counter, " lines"  
   
   
 ##      EXECUTE
 files = getfiles(args.path)  
 read(files)  

This will write to MongoDB.

Command line usage is:
 python import.py 1000 /media/data/records.json mydb mycollection -v

The -v flag is optional and will log in a verbose manner to the console.


Other Considerations


I've noticed that twitter data from the GNIP firehose can be imported directly into MongoDB.

On the other hand, Java objects serialized into JSON using the GSON package need to be restructured. For example, this an array of objects deserialized using GSON will look like this:
 [  
      { name : "item1" },  
      { name : "item2" },  
      { name : "item-n" }  
 ]  

If you use a web validator / formatter, such as JsonEditorOnline, this output will be parsed correctly, like this:


However, MongoDB doesn't like this syntax, and prefer this approach:
 { name : "item1" }  
 { name : "item2" }  
 { name : "item-n" }  

Note the absence of both commas to separate the items and the lack of braces at the beginning and end of the structure.


MacOS

The instructions don't vary greatly.

I prefer to use a virtualenv on my local dev environment. Virtualenv is described in this blog post here.

Set up the virtualenv on the terminal:
virtualenv --system-site-packages .
source bin/activate

Once inside the virtualenv, install pymongo:
(data-imdb-populate-mongo)~/workspaces/data-imdb-populate-mongo$ pip install pymongo
Collecting pymongo
  Downloading pymongo-3.2-cp27-none-macosx_10_8_intel.whl (263kB)
    100% |████████████████████████████████| 266kB 1.4MB/s 
Installing collected packages: pymongo
Successfully installed pymongo-3.2



References

  1. Python Argparse
    1. The first part of this program uses argparse to access the command line arguments from the user to the program
  2. [Offical Documentation] PyMongo Tutorial
    1. This tutorial is intended as an introduction to working with MongoDB and PyMongo
  3. Unix ULIMIT settings
    1. I've noticed the bulk insert with PyMongo has a tendency to run out of memory.  This details a method for limiting and controlling the usage of system resources that might help.
      1. [StackOverflow] PyMongo Bulk Insert Runs out of memory
      2. [MongoDB JIRA] Bug Report (fixed)

8 comments:

  1. Really nice post . Especially the exceptions your are catching in your try catch they can be a pain.

    ReplyDelete
  2. I'm getting the following error message and my one json file (myfile.json) isn't importing:

    TypeError: 'unicode' object does not support item assignment

    ReplyDelete
    Replies
    1. It also says the following early in the error stream:

      Value Error: Extra data: line 1 column 9 - line 2 column 1 (char 8 - 20)
      Value Error: Extra data: line 1 column 13 - line 2 column 1 (char 12 - 16)
      Value Error: Extra data: line 1 column 13 - line 2 column 1 (char 12 - 26)
      Value Error: Extra data: line 1 column 11 - line 2 column 1 (char 10 - 24)
      ...etc.

      And keeps going for about 20 lines.

      Not sure what the problem may be.

      Delete
    2. Figured it out. The section in the read(jsonFiles) function needs to be...

      -----------
      for jsonFile in jsonFiles:
      with open(jsonFile) as f:
      data = f.read()
      jsondata = json.loads(data)
      try:
      db[args.collection].insert(jsondata)
      counter += 1
      etc.
      -----------
      I tried this with importing a single json document and it worked.

      Delete
  3. Take care of Problem in Importing MongoDB Database with MongoDB Technical Support
    On the off chance that you discover any issue with respect to MongoDB like, not ready to import MongoDB database at that point attempt beneath recorded strides to unravel your bringing in issue. Initially you need to check how vast the accumulation is then check do both the servers have same measure of physical memory or not. Subsequent to attempting these means if as yet confronting a similar issue at that point contact to MongoDB Online Support or MongoDB Customer Support USA.
    For More Info: https://cognegicsystems.com/
    Contact Number: 1-800-450-8670
    Email Address- info@cognegicsystems.com
    Company’s Address- 507 Copper Square Drive Bethel Connecticut (USA) 06801

    ReplyDelete
  4. Great information, better still to find out your blog that has a great layout. Nicely done https://python.engineering/python-extract-words-from-given-string/

    ReplyDelete