top of page
Search

Importing and exporting data in python

Updated: Apr 10, 2020


I recently decided to systematically work on my python skills - and I started with importing and exporting data. I am one of those who have 'picked up coding on the fly'. In other words, I took an introductory class which taught me the basics [A shoutout to the awesome computer linguistics department at HHU #hhu], and then just went from there. In my PhD-project, I worked with a bunch of txt- and csv-files. Thus, I learned how to read and write those types of files. I also taught myself how to find certain structures in a text, how to manipulate texts, how to filter and merge data... I learned what I needed and I made it work! Sounds pretty impressing to me...


However, since I left academia I realized that this 'on the fly learning' also has some pretty negative sides. I have no freaking clue of how my skills stand up against others, and if someone asks:' Can you read and write files?' My answer usually is: 'Well, I guess, yeah. I mean I have worked mainly with .txt and .csv-files, but it shouldn't be hard to read and write other files...[ENTER RAMBLING].' Now, the good news is: It really is NOT that hard to read and write other types of files, and I actually picked it up on one weekend.


What I found a bit tricky though is finding all the information in one place - I am sure there are great resources and I am not reinventing the wheel here. DataCamp, for example, has some pretty neat classes on importing files. But I am a huge fan of comprehensive summaries and therefore I decided to share my notes on reading and writing different types of files in python.


Below you will find information on how to import and export the following file types: txt-files, csv-files, Excel files, SQL databases, pickled files, HTML and JSON files. Note, that this is a very informative type of text which serves the purpose to give some practical tips. In other words, do not expect the most entertaining blog entry here ;-). I hope that you find the description below as well as the code provided helpful. Note that I also uploaded the code and the files I on GitHub.

1. Txt- files


The good news is, to read and write txt-files, you do not need to load any additional packages and it is done with just a few commands. So let's dive right in. First, we will look at reading files, and then we will go on to writing them.


There are several options on how to read files. One option is to open and read the file step by step (see code below). To do this you first open the txt-file with the open()-function and save it. Make sure that the txt-file is in your directory. If it is not, you can add the path to the file name so that python can find the file (depending on your operating system you might have to use '/' or '\' in your path). The argument 'r' in the open()-function defines that you want to read the file (as opposed to write it), and the encoding tells python how to read the file. If you do not define the encoding correctly you might end up with an illegible text (if you want to find out more about encoding, this is a very helpful article).


After opening the file, you then read the file. There are several options on how to do this, three of which I am showing you here. The first one is the read()-method which reads the file as one long string. The second option is the method readline() which reads one line at a time. The last option is the method readlines() which reads the text as a list of lines, i.e. each line is one element in the list. Depending on what you want to do with the content of the txt, and what the content is, one option might be better suited than another. I often read in the txt as a list of lines (readlines()), and then iterate through each line to perform a certain task on each line. It is crucial that you close the file after you 'are done' with it.


#Step by step solution

# Step 1: opening and saving the file as myfile
myfile= open('Blurb and bio.txt','r', encoding='cp1252')

# Step 2: reading the file
# the whole text is read as one string,and linebreaks are indicared by '\n'
myfile.read()

# reads one line at a time
myfile.readline()

# the whole text is read as list of lines
myfile.readlines()

# iterating through each line of a text and printing it
for line in myfile:
    print (line)
    
#   
myfile.close()

Instead of opening and closing the file in two separate commands, and maybe forgetting the closing part..., you can also do what I call 'the fast track' (see code below). Here, instead of opening and closing the file, you simply use the syntax with open (filename, 'r', encoding='..') as NAME: followed by whatever it is that you need to do with the opened file. There is no need to close the file.


# the 'fast track'   

with open ('Blurb and bio.txt','r') as myfile:
          text= myfile.read()

Now, let's turn to writing txt-files, which is actually pretty easy after knowing how to open them. Below you find the code. It is pretty similar to the one you saw above. First, you open the file, but then instead of reading it ('r') you write it ('w'). Afterwards, you can use different methods to write in your file. One method is write(). It simply writes a string to your file. Another possibility is using the writelines() method which takes a list as its argument and writes each element of the list to the file. To attain a particular format, you can, for example, combine the write-methods with a for-loop and a string formatting method. In the end, you need to close the file (unless you use the 'fast track' as shown above - it works the same way as with reading files).


# open a file and save it as myfile
myfile=open('new_text.txt',"w", encoding='utf8')

# write some text
myfile.write('hello')

# write all elements of the list to the file
myfile.writelines(['0\n','1\n','2\n'])

# combining a for loop (index 1-7) with different writing option
for index in range(7):

   # writes each number in the file followed by a line break
   myfile.writelines([str(index),'\n'])
   
   # writes each number in the file followed by the provided sentence   
   myfile.write("%s what a pretty format" %index)

# don't forget to close the file
myfile.close()

2. CSV- files

Let's move on to csvs. When I was working in academia csvs were undoubtedly the most frequently used type of data I used, and I cannot stress enough how important it is to have a working method of importing, opening, manipulating and exporting csv-files. There is just so much which can go wrong! But let's not concentrate on what can go wrong but rather see how it can go right using python's csv and pandas packages - two invaluable packages I must say!


So, I already spoilered that you need additional packages for working with csvs. I usually use a mixture of the csv package and pandas. I found pandas especially useful when working with data frames in python, e.g. adding columns, merging data. When merely opening and closing csvs, the csv package should also suffice.


Below you find the code for importing a csv. First, you need to load the packages. Then you can use the read_csv()-function (pandas) to read a csv. There are a bunch of arguments which you can define but the most important are the name of the file (add directory when in a different directory than your script), the type of separator in your csv (this is necessary so that python knows where the different columns are), the index (if there is no index, you can set it to False - then python will assign a numerical index), and na_values (here you can define which strings in your data frame python should interpret as missing). I usually check whether my data frame is imported correctly by printing out its head and dimensions. And that is basically it. One simple line and you imported a data frame from a csv.


# Import pandas and csv

import pandas as pd
import csv as csv

# import and save the df as mycsv

mycsv=pd.read_csv("Intel.csv", sep=";", index_col=False, na_values="Na")

# check whether it worked
print(mycsv.head())
print(mycsv.shape)  

After importing a data frame you would probably want to do something with the data frame, e.g. add columns, delete rows, merge two data frames...the opportunities seem endless. I, and probably thousands of others, would recommend pandas for dealing with data frames. But let's not go into more detail here, instead, let's concentrate on writing the data frames.


If you already have a data frame, one simple method (to_csv()) can be applied to the data frame to save it as a csv. It is as simple as that! You can define, the name of the file, as well as the separator, the encoding, and probably a few other things which I cannot think of at the moment. In the code below I built a df from two lists and then saved it as a csv.


# Writing a data frame 

a=[0,1,2,2,2,5]
b=['a','b','c','d','f','a']
c=['a','b','c','d','f','a']

d= list(zip(a,b,c))
header= ['Eins','Zwei','Drei']
df=pd.DataFrame(d, columns=header)
print(df.head())

# let's save our df

df.to_csv('new_csv.csv',sep=',')


If you need to write a csv from scratch, you could do this by using a dictionary as I have done it below. First you open a new csv file (the syntax should look familiar), then you define a list of column names, and then you create a writer object in which you write using different methods, e.g. writerheader() or writerow(). The arguments of the writerow()-method are dictionaries. Every key represents a column and the corresponding value represents the cell in this column. A more elegant solution than the one shown below is to use a for loop in which you iterate through the dictionary to write your csv.


# writing a csv from a dict

with open('names.csv', 'w') as csvfile:
    fieldnames = ['first_name', 'last_name']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerow({'first_name': 'Donald', 'last_name': 'Trump'})
    writer.writerow({'first_name': 'Bernie', 'last_name': 'Sanders'})
    writer.writerow({'first_name': 'Fred', 'last_name': 'Bananas'})
    

3. Excel-files


A lot of people use excel-files, especially people who are not very familiar or comfortable with data analyses or data science. Sometimes, I have especially seen it with my students, people create complex databases consisting of several excel sheets. If you want to analyze the data found in these excel sheets you often need to convert the data into an analyzable shape, meaning you must basically restructure the whole thing (in the best case scenario you do this without giving the creator of the database the feeling that their hours of work in organizing the data were wasted). As mentioned above, you can play around with data frames using the pandas package. But to do so, you need to import excel-files into python - a task which is actually pretty easy.


The pandas package has a function ExcelFile() which takes your filename as the argument and which reads in the file as an ExcelFile object. The neat thing about this object is that you can access the different sheets of your excel file using the sheet_names attribute. Using the method parse() with a specific sheet name as an argument you can access the different sheets of your excel sheet and save them as data frames. You can see this in the code below in which I save the sheet 'Tabelle1' as a data frame ('Tabelle_1' - I know not a very creative name.). I then created a new df by just selecting the first ten rows of Tabelle_1.


# use the pandas package

file='Daten Íntelligibility.xlsx'
excel_data= pd.ExcelFile(file)

# acess different sheets
print(excel_data.sheet_names)
#['Tabelle1', 'Tabelle2']    

Tabelle_1=excel_data.parse('Tabelle1')

Short_Tabelle1=Tabelle_1.head(n=10)

Now, let's see how we can save a data frame as a csv. This is sooooo easy! You just use the method to_excel() on your data frame and pass the desired filename as an argument (see code below). If you want to write and Excel file with several sheets you can use the pandas function ExcelWriter() with the file name as an argument. You can then use the to_excel()-method on the different data frames passing the writer object and the sheet-names as arguments. Well, and that is it! I told you it was pretty easy.


# write new excel with 1 sheet
Short_Tabelle1.to_excel("new_excel.xlsx")

# write new excel with several sheet
with pd.ExcelWriter('new_big_excel.xlsx') as writer:  
    Short_Tabelle1.to_excel(writer, sheet_name='Sheet_name_1')
    Short_Tabelle1.to_excel(writer, sheet_name='Sheet_name_2')

4. SQL-databases


SQL, or as I call it one of the holy grails of data science, is everywhere. Everyone requires you to know SQL, everyone seems to use it! I have to admit, I am not an expert on SQL but as I see it, it is really not THAT complicated. SQL-databases are relational databases meaning that you have a number of tables that are linked to each other and that can be searched using SQL (makes sense as SQL stand for Structured Query Language). There are a number of different tools with which you can query an SQL database but the cool thing is: As a python-user you do not really have to care too much about them (at least I do not). The reason is that python has a package (sqlalchemy) with which you can connect to SQL databases, search them and extract data frames!


To work with an SQL database, you first need to import the pandas package and the create_engine()-function from the sqlalchemy package. Then you need to create an engine by passing your SQL database to the create_engine()-function. I have to confess that I do not know exactly what the engine object is BUT the important thing is that it connects you to your SQL database. Once you are connected with your database you can access all the information in your tables! With the method table_names() you can, for example, print out all the tablenames of your database. Using the very useful pandas function read_sql_query(), you can query the database using SQL syntax. SQL syntax is not too difficult and there are hundreds of tutorials out there that will teach you the basics. In the code below, I selected all columns from the table 'Player' using SQL syntax. Do not forget to pass your engine object to the read_sql_query()-function! The really cool thing about this function is that it returns a data frame, i.e. something you can manipulate easily using pandas and which you can, as I did it below, save as a simple csv (I told you before - I am a big fan of csvs!).


# use pandas as pd and sqlalchemy
from sqlalchemy import create_engine
import pandas as pd

engine= create_engine('sqlite:///database.sqlite')
print(engine.table_names())

sql_df= pd.read_sql_query("SELECT*FROM 'Player'", engine)
print(sql_df.head())

# let's just save that as a csv

sql_df.to_csv("new_soccer.csv", sep=",")

I have not yet created any SQL databases. However, I will make sure to update this post if I ever feel the urge to create or update a SQL database. For those of you who were exactly searching for how to create an SQL database in python - I am sorry!!!

5. Pickled files


What the hell are pickled files? That was the first thing that came to my mind when I came across this term. When I found out what they are, it all made sense though. Pickled files are python specific non-flat files. This means that these are files that can store complex data structures, such as class objects. It pickles them! What is important to understand is is that these files are not 'human-readable'. Try opening a pkl-file with notepad and you see what I mean! So, if you want to open a pickled file, you need to do it in python. You can use the package pickle to do so (pretty fitting name isn't it?).


I would like to start with writing a pickled file. In the code below, after importing the package pickle, I am defining the class 'Fruits'. I then create the object 'banana', as well as two attributes of that object. I am doing this, so that I have something to save as a pickled file, i.e. a non-flat structure. To write the pickled file containing the banana object with its attributes, I am opening a file (you already know this function) and save this as 'filehandler'. To write the file, I then use the dump()-function passing the object 'banana' and my filehandler to it. I then close the file. The file is saved. Note, that you can of course again use the 'fast-track' as described above in the txt-section. Also, you do not need to save the filehandler but could directly pass the function. Sometimes, I just prefer the step by step solutions as they are so much easier to follow.


#importing the package
import pickle
  
# we define a class which and an object in the class (just for the sake
# of having something to save)

class Fruits: pass

banana = Fruits()

banana.color = 'yellow'
banana.value = 30

# writing a pickled file
filehandler = open("Fruits.pkl","wb")
pickle.dump(banana,filehandler)
filehandler.close()
  

Now, let us look at how to open a pickled file. We are again using the pickle package. We open the file by using the open()-function (see txt-files to read about this function), and then [HERE COMES THE CRUCIAL PART], we load the file by using the pickle function load(). Do not forget to save the newly loaded file, and of course do not forget to close the file! In the code below, I then check whether everything was loaded the correct way by printing out the object's attributes.


Instead of doing it step by step, you can also use the fast track using the with-statement-thing (the code is also provided below). As you can see, things are starting to become repetitive - and that actually is a good thing! Reading and writing different file types really follows the same logic over and over again- well at least most of the time. We are moving on to HTML and things are actually a little bit different there.... but let's see.


import pickle

# Step by step
# open a file
file = open("Fruits.pkl",'rb')
# load the file
object_file = pickle.load(file)
#close the file
file.close()

# checking whether everything worked by printing the object's attributes
print(object_file.color, object_file.value, sep=', ')
    
#or

#fast-track
with open ('Fruits.pkl', 'rb') as file:
    pic_data=pickle.load(file)

# checking whether everything worked by printing the object and its attribute color
print(pic_data)
print(pic_data.color)   

6. HTML


HTML code is annoying - well, at least it is for me who most of the time just wants to scrape clean text from a website. I have reason to believe that there are more people like me who have a desire to attain 'clean text' from a website. Enter: BeautifulSoup. BeautifulSoup is a python package with which you can access different parts of HTML code using the tags which are already present in HTML. I have to confess that I have not mastered all intricacies of HTML or of BeautifulSoup but I have found a pretty well-working workflow that allows me to scrape text from a website and bring it into a format that I can work with.


First, I use the requests package to scrape the text from the web. To do so I import the package, define my URL and then use the get()-function with the URL as its argument to literally get the data from the web. I save the data as 'r'. With the attribute .text, I then attain a text version of the HTML. However, when you print the HTML-text, you will see that this text is very messy and has tons of tags in it, pretty much a lot of things we, or at least I, want to get rid of. To do so, I use BeautifulSoup.


# we use the request package

import requests

#specify url
url='https://en.wikipedia.org/wiki/Giant_panda'

# submit request
r =requests.get(url)

#get the html text
html_text=r.text

print(html_text)


To understand what BeautifulSoup does, and what I intend to do with it, I need to explain a bit about HTML. So, HTML consists of a bunch of elements such as the title, a paragraph or an image. The content of an element is surrounded by tags that tell you more about the element. For example <title> My bad HTML explanation <\title> could be the title element of an HTML-document, with <title> being the starting tag and <\title> being the closing tag. There are tons of different elements, and often you find additional information about the elements in the tags (attributes). Furthermore, elements can be nested so that HTML becomes a tree-like structure. BeautifulSoup makes use of HTML's hierarchical structure and its tags to find certain elements in the HTML. So, let's see how it is done.


First I am importing BeautifulSoup from bs4. Then I create a BeautifulSoup object by passing the HTML text to the BeautifulSoup()-function. I am also passing an HTML-parser as an argument to this function but this is not obligatory and BeautifulSoup will usually tell you something about the parser if you leave it out (yes, I know - not a very sophisticated approach but it works...). After you have your soup-object you can have a look at the prettified soup (prettify()-function) which is always interesting as you can really see the tree structure of the HTML.


After creating my soup object, I get rid of the elements I am not interested in, i.e. the script and style elements. I do this by iterating through these elements in my HTML (remember BeautifulSoup can identify different elements by accessing the tag structure) and deleting these elements from the soup. Then I use the get_text()-method to extract the plain text from the soup object. After that I use some string operations to 'clean' the text. First, I use a list comprehension to split the text into lines and remove all trailing white spaces. Then I drop all blank lines by iterating through the list of lines and checking whether there is actually an element present. If so, I join the element with a line break, i.e. \n. Thus, all non-empty lines are joined and now form a text which can then be worked with, printed out or save.


# we need beautiful soup to make html more accesible 

from bs4 import BeautifulSoup

#make soup object
soup = BeautifulSoup(html_text, 'html.parser')

#print prettified soup
print(soup.prettify())

# delete all script and style elements
for script in soup(["script", "style"]):
    # rip it out
    script.extract()   

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = [line.strip() for line in text.splitlines()]

# drop blank lines and join lines again
text = '\n'.join(element for element in lines if element)

#check whether what I did worked
print(text, type(text))

# store data as text
pandas=open("pandas.txt","w", encoding='utf8')
pandas.write(text)
pandas.close()

While the text the code above produces might not be perfect, e.g. it till entails 'n\a's and some Chinese characters, I found the code to be very useful for generating a 'workable' text. BeautifulSoup seems is a very powerful and useful package. So if you work a lot with HTML, or are planning to do so, I would really look into it!

7. JSON and what is an API


Last but not least I want to talk about APIs and JSON files. API stands for Application Programming Interface and as I understand it, an API is a set of routines that helps you to connect and interact with websites. Connecting with a website is necessary when you want to scrape data from that website. I think of an API as a linking element that translates between different software. So, when you want to scrape data from the web, you often need to 'talk to the API'. Luckily that is not too hard using python's requests package (which you already know from above).


Similar to what you do when you pull HTML files from the web, connecting to an API requires you to first install the requests package. Then you define your URL and then use the get()-function of the requests package to pull the data. Importantly, when connecting to an API to pull data, a lot of times you need to enter an API key (in the code below I just inserted YOUR_KEY where your key is supposed to go). The websites are usually pretty good at explaining how to attain a key and how to use it, i.e. how the syntax of the URL works. By the way, in the example below I connect to the Open Movie Database to extract a JSON-file containing information about the movie Harry Potter.


import request

url="http://www.omdbapi.com/?t=Harry+Potter&apikey=YOUR_KEY"

r= requests.get(url)

So now that we have pulled the data from the web, we need to load the data as a JSON file. JSON stands for JavaScript Object Notation, so we have a Java file here. The neat thing about JSON files is that their structure resembles the structure of python dictionaries and is human-readable (not like pickled files).


To load the JSON file, we can use the json-package. We import it and then simply use the .json()-method on the pulled object and that is it! Python loads the JSON as a dictionary. You can thus iterate through its keys and values to explore the file's structure. You can print the dictionary's keys or the values of a key. In other words, just do whatever you do with a dictionary!


import json   

#load json file which is imported from the web  

json_data2=r.json()   

#iterate through keys and values  
for key, value in json_data2.items():
          print(key,": ", value)    

#print keys and access values of keys
print(json_data2.keys())  
print(json_data2['Website'])  
print(json_data2['Rated'])

To open a JSON file which you have already stored in your directory, you can use the open()-function and combine it with the load()-method as it is done in the code below.


# opening a stored JSON file

with open('example_2.json') as json_file:                     
  json_data=json.load(json_file)
  

Now, let's write a JSON file. As we have seen above, JSON files correspond to Python dictionaries. So, we can just take a dictionary and save it as a JSON. In the code below, I first create a dictionary which I then save as a JSON by opening a new file and using json's dump()-function. And that is it! Not too difficult if you asked me.


#let's create a dict  
small_dict={'animals':{'birds':["eagle","swan"],'fish': 'clownfish','cod','salmon']},'plants':['fern','mint','tulip']}   

# check my dictionary
print(small_dict.keys())  
print(small_dict['animals']['birds'])  

# let's save it as a json  
with open("sample.json", "w") as outfile:      
    json.dump(small_dict, outfile)

The end


At last, let me say that there are many more ways of how to read and write files in python. There is an abundance of packages, functions and methods, and there are probably so many more elegant and cool ways to import and export data. So, just google around and keep finding simpler and more efficient ways!




 
 
 

Recent Posts

See All

Comments


©2020 Sonia Ben Hedia. Erstellt mit Wix.com

bottom of page