10 Python pandas tricks all data scientists should know… 👌

For those of us involved in machine learning and data science, python is an absolute MUST. Working with panda’s dataframes is your key to success and here are some of the most useful tricks I have picked up over the last few years.

Shaun Enslin
4 min readOct 16, 2022
Source: unsplashed.com

1. Select from table where f1=’a’ and f2=’b’

When selecting a subset of a dataframe, we often need to use an AND or an OR.

dfb = df.loc[(df.Week == week) & (df.Day == day)]

The same can be said for an OR, use the pipe character

dfb = df.loc[(df.Week == week)|(df.Day == day)]

2. Select where in

Such a good feature is being able to select from a pandas dataframe where a column is in another dataframe or series. ie:

select * from table1 where field1 in (select field1 from table2)

Lets get a nice simple example, we have a table called “days” which contains the following values.

Table 2: Array called days

Now, if we have a 2nd dataframe, filled with the following values:

Dataframe with a CallCycleDay field called rs
days = [0,1,2]
df[df(days)]

3. Select where not in

Just as we want to select IN, we also want to select NOT IN.
Using above tables, lets do the same now

days = [0,1,2]
df[~df(days)]
Result of select not in

4. select sum(*) from table group by

Dataframes have group by’s. That's great, learn these commands for quick summaries on your data. It doesn’t get easier…

df(by=['RepID','Week','CallCycleDay']).sum()

If you do want to save your results or use them later and reference the fields, then add the as_index=False

rstot.groupby(by=['RepID','Week','CallCycleDay'], as_index=False).sum().to_csv('my file.csv')
Using index=Fales, give you in a table form with columns you can reference

5. Update fields in table 1 from table2

This is useful when you have made a copy of a dataframe, changed some values and now want to update the original dataframe. In below example, we make a copy of a part of our dataframe, then update its values, then we want to update the values in field2 back to the original dataframe.

NB. update will use the index to join

dfb = dfa[dfa.field1='somevalue'].copy()
dfb['field2'] = 'somevalue'
dfa.update(dfb)

6. Create a new field using apply/lambda

You may want to iterate through the dataframe row by row, then apply some logic to each field. Below is an example where we create a new field called address, which is a concat of a few fields.

dfa['address'] = dfa.apply(lambda row: row['StreetName'] + ', ' + 
row['Suburb'] + ', ' + str(row['PostalCode']),axis=1)

7. Insert a new row

Without a doubt the best way to insert is to use concat. The one trick is your new row needs to be converted to a dataframe. Have no fear, we have pd.DataFrame.from_records to do the job for us.

newRow = row.copy()
newRow.CustomerID = str(newRow.CustomerID)+'-'+str(x)
newRow.duplicate = True
df = pd.concat([df,pd.DataFrame.from_records([newRow])])

8. Changing the column type

At times you may a field which you actually would prefer to be string, but when reading from excel or CSV it brings it in as a number. You can use the astype function to quickly change it to a string.

df = pd.read_excel(customers_.xlsx')
df['Longitude'] = df['Longitude'].astype(str)
df['Latitude'] = df['Longitude'].astype(str)

8. Dropping columns

Yes, our dataframes do tend to get messy. This is more so the case when using group by’s and then saving the results. Before you save your dataframe, clean up columns as below using the drop function.

def cleanColumns(df):
for col in df.columns:
if col[0:7] == "Unnamed":
df.drop(col, inplace=True, axis=1)
return df

9. Plotting geo results to a map

If you are in any way working with location data, you need to plot the results. Below gives a good example where we are plotting customer locations, along with the centroid position for each cluster of customers.

View of the data we are attempting to plot
df_clustercentroids = pd.read_csv(centroidFile)# Get our list of unique clusters
lst_elements = sorted(list(dfm.cluster2.unique()))
# Now make up random colors for each unique cluster
lst_colors = ['#%06X' % np.random.randint(0, 0xFFFFFF) for i in range(len(lst_elements))]
# Add color column to dataframe and apply colors
dfm["color"] = dfm["cluster2"]
dfm["color"] = dfm["color"].apply(lambda x:lst_colors[lst_elements.index(x)])
m = folium.Map(location=[dfm.iloc[0].Latitude,dfm.iloc[0].Longitude], zoom_start = 9)for index, row in dfm.iterrows():
folium.CircleMarker(location=[float(row['Latitude']), float(row['Longitude'])],radius=4,popup=str(row['RepID']) + '|' +str(row.CustomerID),color=row['color'],fill=True,fill_color=row['color']
).add_to(m)
# Plot centroids
for index, row in df_clustercentroids.iterrows():
folium.Marker(location=[float(row['Latitude']), float(row['Longitude'])],popup=str(index) + '|#=' + str(dfm.loc[dfm.cluster2==index].groupby(['cluster2'])['CustomerID'].count().iloc[0]),icon=folium.Icon(color='black',icon_color=lst_colors[index]),tooltip=str(index) + '|#=' + str(dfm.loc[dfm.cluster2==index].groupby(['cluster2'])['CustomerID'].count().iloc[0])
).add_to(m)m

Below is the result of color coding each cluster and using a pin for the centroid position of each cluster.

--

--

No responses yet